머신러닝을위한Python기초


머신러닝을위한python기초


자주 쓰는 개발환경




math module_math 모듈_math 模块




문서 호출 및 편집


  • .open

    一个文件操作对象由内嵌函数 ‘open’ 创建

    1
    >>> f = open('test.txt')      # 默认: 只读模式
  • .read

    读取整个文件 (或者说N 字节),返回一个单独的字符串

  • .readline

    读取一行(然后跳到新的一行)

    1
    2
    3
    >>> line = f.readline()       # 读一行
    >>> line
    'This is the first line.\n'
  • .readlines

    读取所有的行,返回一个字符串的列表

    1
    2
    3
    >>> lines = f.readlines()     # 读所有剩余行
    >>> lines
    ['This is the second.\n', 'And third.\n']
  • .write

    write 函数只是简单地输出给定的字符串
    字符串不一定是ASCII码,二进制串也是可以的

    1
    2
    3
    4
    5
    >>> w = open('output.txt', 'w')         # 写模式 (默认写的是文本)
    >>> w.write('stuff') # 并不自动添加新行
    >>> w.write('\n')
    >>> w.write('more\n and even more\n')
    >>> w.close()
  • .close

    1
    2
    3
    4
    stuff
    more
    and even more

  • for loop으로 호출하기

    •注意: 每行结尾会尾随一个换行符 ‘\n’

    •使用字符串方法’strip’或者’rstrip’去除它

    1
    2
    3
    4
    5
    6
    7
    8
    infile = open('test.txt')               # 只读模式
    outfile = open('test_upper.txt', 'w') # 写模式; 创建文件

    for line in infile:
    outfile.write(line.upper())

    infile.close() # 并不严格要求; 系统会自动执行
    outfile.close()
  • Numpy__ .savetxt()

    savetxt()函数将一个数组保存到一个文本文件中:

    1
    2
    3
    4
    5
    6
    >>>a = np.linspace(0, 1, 12); a.shape = (3, 4); a
    array([[ 0. , 0.09090909, 0.18181818, 0.27272727],
    [ 0.36363636, 0.45454545, 0.54545455, 0.63636364],
    [ 0.72727273, 0.81818182, 0.90909091, 1. ]])
    >>>np.savetxt(“myfile.txt”, a)

  • Numpy__ .save()

    save()函数将一个数组存成一个Numpy的“.npy”格式的二进制文件:

    1
    >>>np.save(“myfile”, a)

    生成一个二进制文件myfile.npy包含数组a,之后可以使用np.load()函数读入内存

  • Numpy_ .loadtxt()

    函数把一个存成文本文件的数组读入内存
    缺省地,该函数假设列是用空白符分隔的。可以通过修改可选的参数来改变此假设。

    1
    2
    3
    4
    5
    6
    7
    8
    '''示例文本文件data.txt:
    # Year Min temp. Max temp.
    1990 -1.5 25.3
    1991 -3.2 21.2'''
    >>>table = np.loadtxt(“data.txt”)
    >>>table
    array([[1.99000000e+03, -1.50000000e+00, 2.53000000e+01],
    [1.99100000e+03, -3.20000000e+00, 2.12000000e+01]])


주로 사용할 패키지_Numpy,Matplotlib,Pandas……


1st. Numpy(Numercial Python Extensions) :
  1. import하기

    1
    import numpy as np
  2. 간략한 소개

    1
    2
    3
    4
    5
    6
    7
    8
    np.array([2,3,6,7])
    >>> array([2,3,6,7])
    np.array([2,3,6,7.])
    >>> array([2.,3.,6.,7.])
    np.array([2,3,6,7+1j])
    >>> array([2.+0.j, 3.+0.j, 6.+0j, 7.+1.j])

    '''Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy offers several functions to create arrays with initial placeholder content. These minimize the necessity of growing arrays, an expensive operation.'''

    NumPy’s array class is called ndarray. It is also known by the alias array. Note that numpy.array is not the same as the Standard Python Library class array.array, which only handles one-dimensional arrays and offers less functionality. The more important attributes of an ndarray object are:

    • ndarray.ndim ]

      the number of axes (dimensions) of the array.

    • ndarray.shape ]

      the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim.

    • ndarray.size ]

      the total number of elements of the array. This is equal to the product of the elements of shape.

    • ndarray.dtype ]

      an object describing the type of the elements in the array. One can create or specify dtype’s using standard Python types. Additionally NumPy provides types of its own. numpy.int32, numpy.int16, and numpy.float64 are some examples.

    • ndarray.itemsize ]

      the size in bytes of each element of the array. For example, an array of elements of type float64 has itemsize 8 (=64/8), while one of type complex32 has itemsize 4 (=32/8). It is equivalent to ndarray.dtype.itemsize.

    • ndarray.data ]

      the buffer containing the actual elements of the array. Normally, we won’t need to use this attribute because we will access the elements in an array using indexing facilities.

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      import numpy as np
      a = np.arange(15).reshape(3, 5)
      >>> a
      array([[ 0, 1, 2, 3, 4],
      [ 5, 6, 7, 8, 9],
      [10, 11, 12, 13, 14]])
      a.shape
      >>> (3, 5)
      # 行数和列数等等
      a.ndim
      >>> 2
      # 维数
      a.dtype.name
      >>> 'int64'
      a.itemsize
      >>> 8
      a.size
      >>> 15
      # 元素数
      type(a)
      >>> <class 'numpy.ndarray'>
      b = np.array([6, 7, 8])
      >>> b
      array([6, 7, 8])
      type(b)
      >>> <class 'numpy.ndarray'>
  1. A frequent error

    A frequent error consists in calling array with multiple arguments, rather than providing a single sequence as an argument.
    1
    2
    3
    4
    5
    >>> a = np.array(1,2,3,4)    # WRONG
    Traceback (most recent call last):
    ...
    TypeError: array() takes from 1 to 2 positional arguments but 4 were given
    >>> a = np.array([1,2,3,4]) # RIGHT
  1. ppt 자료

    • arange([start,] stop[, step,], dtype=None)

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      np.arange(5)
      >>> array([0, 1, 2, 3, 4])
      np.arange(10, 100, 20, dtype=float)
      >>> array([10., 30., 50., 70., 90.])

      # Transforming
      a = np.arange(0, 20, 1) # 一维数组
      b = a.reshape((4, 5)) # 4行,5列
      c = a.reshape((20, 1)) # 2维
      d = a.reshape((-1, 4)) # -1: 自动决定行数
      a.shape = (4, 5) # 改变a的形状

    • .linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)

      1
      2
      3
      4
      5
      6
      7
      np.linspace(0., 2.5, 5)
      >>> array([0., 0.625, 1.25, 1.875, 2.5])

      from numpy import pi
      x = np.linspace( 0, 2*pi, 100 ) # 用于在多个点执行某函数
      >>> f = np.sin(x)

    • 形状(N, ), (N, 1)和(1, N)不同

      ​ : 形状(N, ):数组是一维的
      形状(N, 1):数组是二维的,N行一列
      形状(1, N):数组是二维的,一行N列

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      import numpy as np

      a = np.array([1, 2, 3, 4, 5]) # 一维数组
      b = a.copy()

      c1 = np.dot(np.transpose(a), b) # 转置对一维数组不起作用
      print(c1)
      c2 = np.dot(a, np.transpose(b)) # 转置也可以写成b.T
      print(c2)

      ax = np.reshape(a, (5, 1))
      bx = np.reshape(b, (1, 5))
      c = np.dot(ax, bx)
      print(c)

    • 用相同元素/用随机数填充数组

      1. .zeros

      2. .ones

      3. .full

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        >>>np.zeros(3)
        array([0., 0., 0.])

        >>>np.zeros((2, 2), complex)
        array([[0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j]])

        >>>np.ones((2, 3))
        array([[1., 1., 1.],
        [1., 1., 1.]])

        >>>np.full((2,2), 7)
        array([[7, 7],
        [7, 7]])
      4. .rand: 0到1之间[0, 1)均匀分布的随机数

      5. .randn:服从均值为0,方差为1的标准正态(高斯)分布的随机数

      6. etc. (也有其他标准概率分布的随机数)

        1
        2
        3
        4
        5
        6
        7
        8
        np.random.rand(2, 4)
        array([[ 0.94672374, 0.0383632 , 0.12738539, 0.21592466],
        [ 0.49394559, 0.2216863 , 0.3053351 , 0.51381235]])

        np.random.randn(2, 4)
        array([[ 1.05383548, -1.2142876 , -0.83458293, 0.53291161],
        [ 0.08311765, 0.14007751, -0.06647882, 1.09115942]])

    • 一维数组索引与切片:
      [start:stop]的索引形式可用于从数组中抽取片段(从start位置开始直到stop位置但不包括stop)

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      >>>a = np.array([0, 1, 2, 3, 4])

      >>>a[1:3]
      array([1, 2])

      >>>a[:3]
      array([0, 1, 2])

      >>>a[1:]
      array([1, 2, 3, 4])

      >>>a[1:-1]
      array([1, 2, 3])
      1. 整个数组:a或者a[:]

        1
        2
        3
        4
        >>>a = np.array([0, 1, 2, 3, 4])
        >>>a[:]
        array([0, 1, 2, 3, 4])

      2. 想取出间隔的元素,可以在第二个冒号之后说明第三个数(步长):

        1
        2
        3
        4
        5
        >>>a[::2]
        array([0, 2, 4])
        >>>a[1:4:2]
        array([1, 3])

      3. 步长-1,可用于反转一个数组:

        1
        2
        3
        >>>a[::-1]
        array([4, 3, 2, 1, 0])

    • 二维数组索引

      1. 多维数组的索引是整数元组:

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        a = np.arange(12); a.shape = (3, 4); a
        array([[ 0, 1, 2, 3],
        [ 4, 5, 6, 7],
        [ 8, 9, 10, 11]])

        >>>a[1, 2]
        6
        >>>a[1, -1]
        7

      2. 二维数组切片:单行单列; 和列表类似

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        a = np.arange(12); a.shape = (3, 4); a
        array([[ 0, 1, 2, 3],
        [ 4, 5, 6, 7],
        [ 8, 9, 10, 11]])
        >>>a[:, 1]
        array([1, 5, 9])
        >>>a[2, :]
        array([8, 9, 10, 11])
        >>>a[1][2]
        6
        >>>a[2]
        array([8, 9, 10, 11])
    • 拷贝与视图:

      • 标准列表的一个切片是它的一个拷贝

      • Numpy数组的一个切片是数组上的一个视图。切片数组和原始数组都引用的是同一块内存区域。因而,当改变视图内容时,原始数组的内容也被同样改变了:

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        >>>a = np.arange(5); a
        array([0, 1, 2, 3, 4])
        >>>b = a[2:]; b
        array([2, 3, 4])
        >>>b[0] = 100
        >>>b
        array([100, 3, 4])
        >>>a
        array([0, 1, 100, 3, 4])

      • 为了避免改变原数组,可以拷贝切片:

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        >>>a = np.arange(5); a
        array([0, 1, 2, 3, 4])
        >>>b = a[2:].copy(); b
        array([2, 3, 4])
        >>>b[0] = 100
        >>>b
        array([100, 3, 4])
        >>>a
        array([0, 1, 2, 3, 4])

    • 数组计算:

      基本的算术运算都作用在数组的元素级别
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      import numpy as np

      x = np.array([[1,2],[3,4]], dtype=np.float64)
      y = np.array([[5,6],[7,8]], dtype=np.float64)

      print(x + y)
      print(np.add(x, y))

      print(x - y)
      print(np.subtract(x, y))

      print(x * y)
      print(np.multiply(x, y))

      print(x / y)
      print(np.divide(x, y))

      print(np.sqrt(x))

    • 矩阵乘法

      矩阵乘法是使用dot函数实现的:

      1
      2
      3
      4
      5
      >>>A = np.array([[1, 2], [3, 4]])
      >>>np.dot(A, A)
      array([[7, 10],
      [15, 22]])

      Dot函数也可以用于矩阵和向量的乘法:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      >>>A
      array([[1, 2],
      [3, 4]])
      >>>x = np.array([10, 20])
      >>>np.dot(A, x) #等价于A.dot(x)
      array([50, 110])
      >>>np.dot(x, A) #等价于x.dot(A)
      array([70, 100])

    • 更高效的数学函数

      ​ Numpy中包含许多常用的数学函数,例如:
      np.log, np.maximum, np.sin, np.exp, np.abs等等(详见:https://docs.scipy.org/doc/numpy/reference/routines.math.html)
      大多数情况下,Numpy中的函数比Math库中类似的函数更高效,尤其是处理大规模数据时

      1
      2
      3
      4
      5
      6
      x = np.array([[1,2],[3,4]])

      print(np.sum(x)) # Compute sum of all elements;
      print(np.sum(x, axis=0)) # Compute sum of each column;
      print(np.sum(x, axis=1)) # Compute sum of each row;

2nd. Matplotlib:

Matplotlib是Python中最常用的可视化工具之一,可以非常方便地创建海量类型的2D图表和一些基本的3D图表
因为在函数的设计上参考了MATLAB,所以叫做Matplotlib
首次发表于2007年,是为了可视化癫痫病人的脑皮层电图相关的信号而研发的,原作者John D. Hunter博士是一名神经生物学家

  1. 最简单的图表

    1
    2
    3
    4
    5
    import matplotlib.pyplot as plt

    plt.plot([1,2,3,4], [1,4,9,16], 'ro')
    plt.axis([0, 6, 0, 20])
    plt.show()
  2. 一张图表中多个函数(1)

    1
    2
    3
    4
    5
    6
    import numpy as np
    import matplotlib.pyplot as plt

    t = np.arange(0., 5., 0.2)
    plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
    plt.show()
  3. 设置线条属性

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    #使用键值对参数:
    plt.plot(x, y, linewidth=2.0)

    #使用Line2D类对象的属性设置方法:
    line, = plt.plot(x, y, '-')
    line.set_antialiased(False) # turn off antialising

    #使用setp()命令:
    lines = plt.plot(x1, y1, x2, y2)
    # use keyword args
    plt.setp(lines, color='r', linewidth=2.0)
    # or MATLAB style string value pairs
    plt.setp(lines, 'color', 'r', 'linewidth', 2.0)

  4. 一张图表中多个函数(2)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    import numpy as np
    import matplotlib.pyplot as plt

    # Compute the x and y coordinates for points on sine and cosine curves
    x = np.arange(0, 3 * np.pi, 0.1)
    y_sin = np.sin(x)
    y_cos = np.cos(x)

    # Plot the points using matplotlib
    plt.plot(x, y_sin)
    plt.plot(x, y_cos)
    plt.xlabel('x axis label')
    plt.ylabel('y axis label')
    plt.title('Sine and Cosine')
    plt.legend(['Sine', 'Cosine'])
    plt.show()
  5. 添加文本

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    import numpy as np
    import matplotlib.pyplot as plt

    # Fixing random state for reproducibility
    np.random.seed(19680801)

    mu, sigma = 100, 15
    x = mu + sigma * np.random.randn(10000)

    # the histogram of the data
    n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)

    plt.xlabel('Smarts')
    plt.ylabel('Probability')
    plt.title('Histogram of IQ')
    plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
    plt.axis([40, 160, 0, 0.03])
    plt.grid(True)
    plt.show()
  6. 添加文本注释

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    import numpy as np
    import matplotlib.pyplot as plt

    ax = plt.subplot(111)

    t = np.arange(0.0, 5.0, 0.01)
    s = np.cos(2*np.pi*t)
    line, = plt.plot(t, s, lw=2)

    plt.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
    arrowprops=dict(facecolor='black', shrink=0.05),
    )

    plt.ylim(-2,2)
    plt.show()
  7. 多张图表:子图表

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    import numpy as np
    import matplotlib.pyplot as plt

    def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

    t1 = np.arange(0.0, 5.0, 0.1)
    t2 = np.arange(0.0, 5.0, 0.02)

    plt.figure(1)
    plt.subplot(211) # 2 rows, 1 columns, index =1
    plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')

    plt.subplot(212) # 2 rows, 1 columsn, index = 2
    plt.plot(t2, np.cos(2*np.pi*t2), 'r--')
    plt.show()
  8. 图像显示

    1
    2
    3
    4
    5
    6
    import matplotlib.pyplot as plt

    plt.figure('A Little White Dog')
    little_dog_img = plt.imread('little_white_dog.jpg')
    plt.imshow(little_dog_img)
    Plt.show()
3rd. Pandas:

Pandas是python的一个数据分析包, 由AQR Capital Management于2008年4月开发,并于2009年底开源出来

  1. 导入惯例:

    因为Series和DataFrame用的次数非常多,所以将其引入本地命名空间中会更方便

    1
    2
    >>>from pandas import Series, DataFrame 
    >>>import pandas as pd
  2. 常用数据结构

    1. Series:

      ​ 一维标记数组,由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。
      类似于Numpy中的一维数组和Python的列表,不同之处是数组和series中存放的是相同类型的元素

      1. 创建Series:传入列表

        • 默认整型索引

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        obj = Series([4, 7, -5, 3])
        >>>obj
        0 4
        1 7
        2 -5
        3 3
        dtype: int64

        >>>obj.values
        array([ 4, 7, -5, 3], dtype=int64)

        >>>obj.index
        RangeIndex(start=0, stop=4, step=1)

        • 给定索引

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        obj2 = Series([4,7,-5,3], index=['d','b','a','c'])
        >>>obj2
        d 4
        b 7
        a -5
        c 3
        dtype: int64

        >>> obj2.index
        Index(['d', 'b', 'a', 'c'], dtype='object')

      2. 访问Series中的元素

        ​ : 可以使用索引来选取Series中的单个或一组值

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        >>>obj2['a']
        -5

        >>>obj2['d']= 6

        >>>obj2[['c','a','d']]
        c 3
        a -5
        d 6
        dtype: int64

      3. 对Series的操作

        ​ : NumPy数组操作,如通过一个布尔数组过滤,纯量乘法,或使用数学函数,将会保持索引和值间的关联。
        还可将Series看成是一个定长的有序字典,因为它是索引值到数据值的一个映射。它可以用在许多原本需要字典参数的函数中。

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        >>>obj2[obj2 > 0]
        d 4
        b 7
        c 3
        dtype: int64
        >>>obj2*2
        d 8
        b 14
        a -10
        c 6
        dtype: int64
        >>>np.exp(obj2)
        d 54.598150
        b 1096.633158
        a 0.006738
        c 20.085537
        dtype: float64
        1
        2
        3
        4
        >>> 'b' in obj2
        True
        'e' in obj2
        >>> False
      4. 创建Series:传入字典

        ​ •如果数据被存放在一个Python字典中,也可以直接通过这个字典来创建Series
        •如果只传入一个字典,则结果Series中的索引就是原字典的键(有序排列)

        1
        2
        3
        4
        5
        6
        7
        8
        >>>sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
        >>>obj3 = Series(sdata)
        >>>obj3
        Ohio 35000
        Oregon 16000
        Texas 71000
        Utah 5000
        dtype: int64

        ​ •下例中,sdata跟states索引相匹配的那3个值会被找出来并放到相应的位置上,但由于“California”所对应的sdata值找不到,所以其结果就为NaN(Not A Number,非数字)

        1
        2
        3
        4
        5
        6
        7
        8
        >>>states = ['California', 'Ohio', 'Oregon', 'Texas']
        >>>obj4 = Series(sdata, index=states)
        >>>obj4
        California NaN
        Ohio 35000
        Oregon 16000
        Texas 71000
        dtype: float64
      5. 检测缺失数据

        ​ •pandas的isnull和notnull函数可用于检测缺失数据
        •Series也提供了类似的实例方法,如obj4.isnull()

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        >>>pd.isnull(obj4) 
        California True
        Ohio False
        Oregon False
        Texas False
        dtype: bool
        >>>pd.notnull(obj4)
        California False
        Ohio True
        Oregon True
        Texas True
        dtype: bool

      6. 自动对齐索引

        ​ •Series在算术运算中会自动对齐不同索引的数据

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        >>>obj3
        Ohio 35000
        Oregon 16000
        Texas 71000
        Utah 5000
        dtype: int64
        >>>obj4
        California NaN
        Ohio 35000
        Oregon 16000
        Texas 71000
        dtype: float64
        >>>obj3 + obj4
        California NaN
        Ohio 70000
        Oregon 32000
        Texas 142000
        Utah NaN
        dtype: float64
      7. Series对象及其索引的name

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        >>>obj4.name = 'population'

        >>>obj4.index.name = 'state'

        >>>obj4
        state
        California NaN
        Ohio 35000
        Oregon 16000
        Texas 71000
        Name: population, dtype: float64

      8. 修改索引

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        >>>obj
        0 4
        1 7
        2 -5
        3 3

        >>>obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

        >>>obj
        Bob 4
        Steve 7
        Jeff -5
        Ryan 3
        dtype: int64
    2. DataFrame:

      ​ 二维表格型数据结构, 含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等),每列都有标签,可看成一个series的字典

      1. 创建DataFrame(1)

        ​ •DataFrame既有行索引也有列索引,它可以被看做由Series组成的字典(共用同一个索引)。
        •最常用的创建方法是直接传入一个由等长列表或NumPy数组构成的字典

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        >>>data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
        'year':[2000, 2001, 2002, 2001, 2002],
        'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
        >>>frame = DataFrame(data)
        >>>frame
        pop state year
        0 1.5 Ohio 2000
        1 1.7 Ohio 2001
        2 3.6 Ohio 2002
        3 2.4 Nevada 2001
        4 2.9 Nevada 2002

        ​ •如果指定了列序列,DataFrame的列就会按指定顺序排列

        1
        >>>DataFrame(data, columns=['year', 'state', 'pop'])

        ​ •跟Series一样,如果传入的列在数据中找不到,就会产生NaN值

        1
        2
        3
        4
        5
        6
        7
        8
        9
        >>>frame2=DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
        index=['one', 'two', 'three', 'four', 'five'])
        >>>frame2
        year state pop debt
        one 2000 Ohio 1.5 NaN
        two 2001 Ohio 1.7 NaN
        three 2002 Ohio 3.6 NaN
        four 2001 Nevada 2.4 NaN
        five 2002 Nevada 2.9 NaN
      2. 访问列

        ​ •通过字典记法或属性,可以将DataFrame的列获取为一个Series:

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        >>>frame2['state']
        one Ohio
        two Ohio
        three Ohio
        four Nevada
        five Nevada
        Name: state, dtype: object
        >>>frame2.year
        one 2000
        two 2001
        three 2002
        four 2001
        five 2002
        Name: year, dtype: int64
      3. 访问行

        ​ •行也可以使用一些方法通过位置或名字来检索,如loc(名字),iloc(位置)

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        >>>frame2.loc['three']
        year 2002
        state Ohio
        pop 3.6
        debt NaN
        Name: three, dtype: object

        >>> frame2.iloc[2]
        Out[15]:
        year 2002
        state Ohio
        pop 3.6
        debt NaN
        Name: three, dtype: object
      4. 修改列

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        >>>frame2['debt'] = 16.5
        >>>frame2
        year state pop debt
        one 2000 Ohio 1.5 16.5
        two 2001 Ohio 1.7 16.5
        three 2002 Ohio 3.6 16.5
        four 2001 Nevada 2.4 16.5
        five 2002 Nevada 2.9 16.5

        >>>frame2['debt'] = np.arange(5)
        >>>frame2
        year state pop debt
        one 2000 Ohio 1.5 0
        two 2001 Ohio 1.7 1
        three 2002 Ohio 3.6 2
        four 2001 Nevada 2.4 3
        five 2002 Nevada 2.9 4
        >>>val = Series([-1.2, -1.5, -1.7], index=[ 'two', 'four', 'five'])
        >>>frame2['debt'] = val
        >>>frame2
        year state pop debt
        one 2000 Ohio 1.5 NaN
        two 2001 Ohio 1.7 -1.2
        three 2002 Ohio 3.6 NaN
        four 2001 Nevada 2.4 -1.5
        five 2002 Nevada 2.9 -1.7
      5. 增加列和删除列

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        >>>frame2['eastern'] = frame2.state == 'Ohio'
        >>>frame2
        year state pop debt eastern
        one 2000 Ohio 1.5 NaN True
        two 2001 Ohio 1.7 -1.2 True
        three 2002 Ohio 3.6 NaN True
        four 2001 Nevada 2.4 -1.5 False
        five 2002 Nevada 2.9 -1.7 False

        >>>del frame2['eastern']
        >>>frame2
        year state pop debt
        one 2000 Ohio 1.5 NaN
        two 2001 Ohio 1.7 -1.2
        three 2002 Ohio 3.6 NaN
        four 2001 Nevada 2.4 -1.5
        five 2002 Nevada 2.9 -1.7
      6. 创建DataFrame(2)

        •传入嵌套字典(字典的字典),外部键会被解释为列索引,内部键会被解释为行索引:

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        >>>pop = {'Nevada': {2001: 2.4, 2002: 2.9},
        'Ohio': {2000: 1.5, 2001: 1.7, 2002:3.6}}
        >>>frame3 = DataFrame(pop)
        >>>frame3
        Nevada Ohio
        2000 NaN 1.5
        2001 2.4 1.7
        2002 2.9 3.6

        >>>frame3 = DataFrame(pop, index=[2001, 2002, 2003])
        >>>frame3
        Nevada Ohio
        2001 2.4 1.7
        2002 2.9 3.6
        2003 NaN NaN
      7. 缺失数据处理

        •删除任何有缺失数据的行:

        1
        2
        3
        4
        >>>frame3.dropna(how='any’)
        Nevada Ohio
        2001 2.4 1.7
        2002 2.9 3.6

        •对缺失值进行填充:

        1
        2
        3
        4
        5
        >>>frame3.fillna(value=5)
        Nevada Ohio
        2000 5.0 1.5
        2001 2.4 1.7
        2002 2.9 3.6
        •判断哪些值是缺失值(nan):
        1
        2
        3
        4
        5
        >>>pd.isna(frame3)
        Nevada Ohio
        2000 True False
        2001 False False
        2002 False False
      8. 查看数据

        •查看DataFrame前n行或后n行

        frame.head(n)
        frame.tail(n)

        •查看DataFrame的索引、列以及底层的Numpy数据

        frame.index
        frame.columns
        frame.values

        •显示数据的快速统计汇总

        frame.describe()

        :对每一列数据进行统计,包括计数、均值、标准差、各个分位数等

        •转置数据

        frame.T

        •对轴排序

        frame.sort_index(axis=1, ascending=False),其中axis=1表示对所有的columns进行排序,下面的数也跟着发生移动。

        •对值排序

        frame.sort_values(by=‘x’) 对x这一列,从小到大进行排序

      9. 选择行与列

        •选取多行或多列:

        frame[[‘state’, ‘pop’]],选择’state’和’pop’两列
        frame[0:3],选择前三行

        •loc用标签选择数据:

        frame2.loc[‘one’],选择索引为’one’的行
        frame2.loc[‘one’, ‘pop’],选择‘one’行,’pop’列
        frame2.loc[:, [‘state’, ‘pop’]],选择所有行,’state’和’pop’列
        frame2.loc[[‘one’, ‘two’], [‘state’, ‘pop’]],选择’one’和’two’行,’state’和’pop’列

        •iloc用位置选择数据:

        frame2.iloc[1:2, 1:2]
        frame.iloc[[0,2], [1,2]]

        •使用条件来选择:

        frame[frame.year>2001],选择year列中大于2001的数据
        frame[frame>2001],选择frame中所有大于2001的数据
        frame[frame[‘year’].isin([‘2000’,‘2002’])],选择year列的值为’2000’,’2002’的所有行

      10. 相关操作

        •统计数据:

        a.mean(),对DataFrame a的每一列数据值求平均值;a.mean(1),则是对DataFrame a的每一行数据值求平均值
        a[‘x’].value_counts(),统计列x中各值出现的次数

        •对数据应用函数:

        a.apply(lambda x:x.max()-x.min()),表示返回所有列中最大值和最小值的差

        •字符串操作:

        a[‘gender1’].str.lower(),将gender1中所有的英文转化为小写,注意dataframe没有str属性,只有series有,所以要选取a中的gender1列。

      11. 读取与写入文件

        •写入.csv文件:

        frame3.to_csv(‘C:\\Users\\qiuyu\\frame3.csv’)

        •读取.csv文件:

        frame4 = pd.read_csv('C:\\Users\\qiuyu\\frame3.csv')
        frame4 = pd.read_csv('C:\\Users\\qiuyu\\frame3.csv', index_col=0)

      +a ) Panel:

      ​ 三维数组,可以理解为DataFrame的容器
      Panel data源于经济学,也是pan(el)-da(ta)-s的名字来源
      https://panel.holoviz.org/reference/widgets/DataFrame.html

4th. Scikit-learn:
https://scikit-learn.org/stable/


교수님 추천자료


•Python文档:https://docs.python.org/3/
•用Python玩转数据:https://www.coursera.org/learn/hipython/
•Codecademy:https://www.codecademy.com/learn/learn-python
•Dataquest:https://www.dataquest.io/


처음으로 수행한 문제들


  • Pandas활용 예제:

  • python기초 문제들

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    '''
    1 给你一个时间t,t是一个字典,共有六个字符串键(year, month, day, hour, minute, second),每个值为数字组成的字符串,
    如t = {‘year’:‘2013’, ‘month’:‘9’, ‘day’:‘30’, ‘hour’:‘16’, ‘minute’:‘45’, ‘second’:‘2‘}。
    请将其按照以下格式输出:XXXX-XX-XX XX:XX:XX。如上例应该输出:2013-09-30 16:45:02。
    请将你的代码编辑成.py或者.ipynb文件提交。
    '''
    import datetime
    date = datetime.datetime.now()
    A = date.strftime('%Y-%m-%d')
    B = date.strftime('%X')
    print(f'{A} {B}')

    t = {‘year’:‘2013’, ‘month’:‘9’, ‘day’:‘30’, ‘hour’:‘16’, ‘minute’:‘45’, ‘second’:‘2’}
    print('{:4d}-{:0>2d}-{:0>2d}' '{:0>2d}:{:0>2d}:{:0>2d}'.format(t))



    '''
    2 给你一个整数组成的列表L,按照下列条件输出:
    若L是升序排列的,则输出"UP";
    若L是降序排列的,则输出"DOWN";
    若L无序,则输出"WRONG"。
    请把你的代码编辑成.py或者.ipynb文件提交。
    '''
    A = input('请给我一个整数组成的列表L,我将按照下列条件输出结果。请输入L:\n若L是升序排列的,将输出"UP";\n若L是降序排列的,将输出"DOWN";\n若L无序,将输出"WRONG"。\n')
    B = list(A)#https://dojang.io/mod/page/view.php?id=2286 #https://ghdwn0217.tistory.com/58
    L = sorted(B, reverse = False) #https://vision-ai.tistory.com/24
    R = sorted(B, reverse = True)#https://itholic.github.io/python-reverse-reversed/
    if B == L:
    print('UP')
    elif B == R:
    print('DOWN')
    else:
    print('WRONG')



    '''
    3. 互联网上的每台计算机都有一个IP,合法的IP格式为:A.B.C.D。其中A、B、C、D均为[0, 255]中的整数。
    为了简单起见,我们规定这四个整数中不允许有前导零存在,如001。
    现在给你一个字符串s(s不含空白符),请你判断s是不是合法IP,若是,输出Yes,否则输出No。
    如:s=“202.115.32.24”,则输出Yes;s=“a.11.11.11”, 则输出No。
    请把你的代码编辑成.py或.ipynb文件提交。
    '''
    import sys
    A = input('Please enter a number for it become new IP address(1 / 4):')
    B = input('Please enter a number for it become new IP address(2 / 4):')
    C = input('Please enter a number for it become new IP address(3 / 4):')
    D = input('Please enter a number for it become new IP address(4 / 4):')
    s = ''
    L = []
    for i in range(0, 256):
    L.append(str(i))
    def diagnose():
    global A
    global B
    global C
    global D
    if A in L:
    while A in L:
    while B in L:
    while C in L:
    while D in L:
    s = f'{A}.{B}.{C}.{D}'
    print('Valid IP was inputed. Please wait...')
    print(f'Yes! Now we have a new IP address which is made by yourself!: ')
    print(f'{s}')
    sys.exit()
    D = input(f'No, the number 4/4 is invalid. Please enter a valid number, which is : 0 ~ {len(L)-1}\n')
    diagnose()
    break
    C = input(f'No, the number 3/4 is invalid. Please enter a valid number, which is : 0 ~ {len(L)-1}\n')
    diagnose()
    break
    B = input(f'No, the number 2/4 is invalid. Please enter a valid number, which is : 0 ~ {len(L)-1}\n')
    diagnose()
    break
    else:
    A = input(f'No, the number 1/4 is invalid. Please enter a valid number, which is : 0 ~ {len(L)-1}\n')
    diagnose()
    diagnose()

Donate
  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2020 GwenChanEn
  • Visitors: | Views:

Do you like my writing? Would you coffee me?

支付宝
微信