머신러닝을위한Python기초

2020-09-02

Machine Learning

Word count: 6.2k | Reading time≈ 31 min

머신러닝을위한python기초

자주 쓰는 개발환경

Jupyter notebook

https://jupyter.org/
Google Colab (GPU사용 가능)

https://colab.research.google.com/

math module_math 모듈_math 模块

https://docs.python.org/3/library/math.html

문서 호출 및 편집

.open

一个文件操作对象由内嵌函数 ‘open’ 创建
1
>>> f = open('test.txt') # 默认: 只读模式

.read

读取整个文件（或者说N 字节），返回一个单独的字符串

.readline

读取一行（然后跳到新的一行）

1
2
3

>>> line = f.readline()       # 读一行
>>> line
'This is the first line.\n'

.readlines

读取所有的行，返回一个字符串的列表

1
2
3

>>> lines = f.readlines()     # 读所有剩余行
>>> lines
['This is the second.\n', 'And third.\n']

.write

write 函数只是简单地输出给定的字符串
字符串不一定是ASCII码，二进制串也是可以的

>>> w = open('output.txt', 'w')         # 写模式 (默认写的是文本)
>>> w.write('stuff')                    # 并不自动添加新行
>>> w.write('\n')
>>> w.write('more\n and even more\n')
>>> w.close()

.close
1
2
3
4
stuff
more
and even more

for loop으로 호출하기

•注意: 每行结尾会尾随一个换行符 ‘\n’

•使用字符串方法’strip’或者’rstrip’去除它

infile = open('test.txt')               # 只读模式
outfile = open('test_upper.txt', 'w')   # 写模式; 创建文件

for line in infile:
    outfile.write(line.upper())

infile.close()                   # 并不严格要求; 系统会自动执行
outfile.close()

Numpy__ .savetxt()

savetxt()函数将一个数组保存到一个文本文件中：

>>>a = np.linspace(0, 1, 12); a.shape = (3, 4); a
array([[ 0.        ,  0.09090909,  0.18181818,  0.27272727],
       [ 0.36363636,  0.45454545,  0.54545455,  0.63636364],
       [ 0.72727273,  0.81818182,  0.90909091,  1.        ]])
>>>np.savetxt(“myfile.txt”, a)

Numpy__ .save()

save()函数将一个数组存成一个Numpy的“.npy”格式的二进制文件：
1
>>>np.save(“myfile”, a)
生成一个二进制文件myfile.npy包含数组a，之后可以使用np.load()函数读入内存

Numpy_ .loadtxt()

函数把一个存成文本文件的数组读入内存
缺省地，该函数假设列是用空白符分隔的。可以通过修改可选的参数来改变此假设。

'''示例文本文件data.txt：
    # Year       Min temp.    Max temp.
       1990           -1.5         25.3
       1991           -3.2         21.2'''
>>>table = np.loadtxt(“data.txt”)
>>>table
array([[1.99000000e+03, -1.50000000e+00, 2.53000000e+01],
       [1.99100000e+03, -3.20000000e+00, 2.12000000e+01]])

주로 사용할 패키지_Numpy,Matplotlib,Pandas……

1st. Numpy(Numercial Python Extensions) :

import하기
1
import numpy as np

간략한 소개

np.array([2,3,6,7])
>>> array([2,3,6,7])
np.array([2,3,6,7.])
>>> array([2.,3.,6.,7.])
np.array([2,3,6,7+1j])
>>> array([2.+0.j, 3.+0.j, 6.+0j, 7.+1.j])

'''Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy offers several functions to create arrays with initial placeholder content. These minimize the necessity of growing arrays, an expensive operation.'''

NumPy’s array class is called ndarray. It is also known by the alias array. Note that numpy.array is not the same as the Standard Python Library class array.array, which only handles one-dimensional arrays and offers less functionality. The more important attributes of an ndarray object are:

ndarray.ndim ]

the number of axes (dimensions) of the array.
ndarray.shape ]

the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim.
ndarray.size ]

the total number of elements of the array. This is equal to the product of the elements of shape.
ndarray.dtype ]

an object describing the type of the elements in the array. One can create or specify dtype’s using standard Python types. Additionally NumPy provides types of its own. numpy.int32, numpy.int16, and numpy.float64 are some examples.
ndarray.itemsize ]

the size in bytes of each element of the array. For example, an array of elements of type float64 has itemsize 8 (=64/8), while one of type complex32 has itemsize 4 (=32/8). It is equivalent to ndarray.dtype.itemsize.

ndarray.data ]

the buffer containing the actual elements of the array. Normally, we won’t need to use this attribute because we will access the elements in an array using indexing facilities.

import numpy as np
a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
a.shape
>>> (3, 5)
# 行数和列数等等
a.ndim
>>> 2
# 维数
a.dtype.name
>>> 'int64'
a.itemsize
>>> 8
a.size
>>> 15
# 元素数
type(a)
>>> <class 'numpy.ndarray'>
b = np.array([6, 7, 8])
>>> b
array([6, 7, 8])
type(b)
>>> <class 'numpy.ndarray'>

A frequent error

A frequent error consists in calling `array` with multiple arguments, rather than providing a single sequence as an argument.

>>> a = np.array(1,2,3,4)    # WRONG
Traceback (most recent call last):
  ...
TypeError: array() takes from 1 to 2 positional arguments but 4 were given
>>> a = np.array([1,2,3,4])  # RIGHT

ppt 자료

arange([start,] stop[, step,], dtype=None)

np.arange(5)
>>> array([0, 1, 2, 3, 4])
np.arange(10, 100, 20, dtype=float)
>>> array([10., 30., 50., 70., 90.])

# Transforming
a = np.arange(0, 20, 1)    # 一维数组
b = a.reshape((4, 5))      # 4行，5列
c = a.reshape((20, 1))     # 2维
d = a.reshape((-1, 4))     # -1: 自动决定行数
a.shape = (4, 5)           # 改变a的形状

.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)

np.linspace(0., 2.5, 5)
>>> array([0., 0.625, 1.25, 1.875, 2.5])

from numpy import pi
x = np.linspace( 0, 2*pi, 100 )  # 用于在多个点执行某函数
>>> f = np.sin(x)

形状(N, ), (N, 1)和(1, N)不同

: 形状(N, )：数组是一维的
形状(N, 1)：数组是二维的，N行一列
形状(1, N)：数组是二维的，一行N列

import numpy as np

a = np.array([1, 2, 3, 4, 5])  # 一维数组
b = a.copy()

c1 = np.dot(np.transpose(a), b)  # 转置对一维数组不起作用
print(c1)
c2 = np.dot(a, np.transpose(b))   # 转置也可以写成b.T
print(c2)

ax = np.reshape(a, (5, 1))
bx = np.reshape(b, (1, 5))
c = np.dot(ax, bx)
print(c)

用相同元素/用随机数填充数组

.zeros
.ones

.full

>>>np.zeros(3)
array([0., 0., 0.])

>>>np.zeros((2, 2), complex)
array([[0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j]])

>>>np.ones((2, 3))
array([[1., 1., 1.],
       [1., 1., 1.]])

>>>np.full((2,2), 7)
array([[7, 7],
       [7, 7]])

.rand: 0到1之间[0, 1)均匀分布的随机数
.randn:服从均值为0，方差为1的标准正态（高斯）分布的随机数

etc. (也有其他标准概率分布的随机数)

np.random.rand(2, 4)
array([[ 0.94672374,  0.0383632 ,  0.12738539,  0.21592466],
       [ 0.49394559,  0.2216863 ,  0.3053351 ,  0.51381235]])

np.random.randn(2, 4)
array([[ 1.05383548, -1.2142876 , -0.83458293,  0.53291161],
       [ 0.08311765,  0.14007751, -0.06647882,  1.09115942]])

一维数组索引与切片:
[start:stop]的索引形式可用于从数组中抽取片段（从start位置开始直到stop位置但不包括stop）

>>>a = np.array([0, 1, 2, 3, 4])

>>>a[1:3]
array([1, 2])

>>>a[:3]
array([0, 1, 2])

>>>a[1:]
array([1, 2, 3, 4])

>>>a[1:-1]
array([1, 2, 3])

整个数组：a或者a[:]

>>>a = np.array([0, 1, 2, 3, 4])
>>>a[:]
array([0, 1, 2, 3, 4])

想取出间隔的元素，可以在第二个冒号之后说明第三个数（步长）：
1
2
3
4
5
>>>a[::2]
array([0, 2, 4])
>>>a[1:4:2]
array([1, 3])
步长-1，可用于反转一个数组：
1
2
3
>>>a[::-1]
array([4, 3, 2, 1, 0])

二维数组索引

多维数组的索引是整数元组：

a = np.arange(12); a.shape = (3, 4); a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

>>>a[1, 2]
6
>>>a[1, -1]
7

二维数组切片：单行单列; 和列表类似

a = np.arange(12); a.shape = (3, 4); a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>>a[:, 1]
array([1, 5, 9])
>>>a[2, :]
array([8, 9, 10, 11])
>>>a[1][2]
6
>>>a[2]
array([8, 9, 10, 11])

拷贝与视图:

标准列表的一个切片是它的一个拷贝
Numpy数组的一个切片是数组上的一个视图。切片数组和原始数组都引用的是同一块内存区域。因而，当改变视图内容时，原始数组的内容也被同样改变了：
1
2
3
4
5
6
7
8
9
10
>>>a = np.arange(5); a
array([0, 1, 2, 3, 4])
>>>b = a[2:]; b
array([2, 3, 4])
>>>b[0] = 100
>>>b
array([100, 3, 4])
>>>a
array([0, 1, 100, 3, 4])

为了避免改变原数组，可以拷贝切片：

>>>a = np.arange(5); a
array([0,  1,  2,  3, 4])
>>>b = a[2:].copy(); b
array([2, 3, 4])
>>>b[0] = 100
>>>b
array([100, 3, 4])
>>>a
array([0, 1, 2, 3, 4])

数组计算:

基本的算术运算都作用在数组的元素级别

import numpy as np

x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

print(x + y)
print(np.add(x, y))

print(x - y)
print(np.subtract(x, y))

print(x * y)
print(np.multiply(x, y))

print(x / y)
print(np.divide(x, y))

print(np.sqrt(x))

矩阵乘法

矩阵乘法是使用dot函数实现的：

>>>A = np.array([[1, 2], [3, 4]])
>>>np.dot(A, A)
array([[7, 10],
       [15, 22]])

Dot函数也可以用于矩阵和向量的乘法：

>>>A
array([[1, 2],
       [3, 4]])
>>>x = np.array([10, 20])
>>>np.dot(A, x)            #等价于A.dot(x)
array([50, 110])
>>>np.dot(x, A)            #等价于x.dot(A)
array([70, 100])

更高效的数学函数

Numpy中包含许多常用的数学函数，例如：
np.log, np.maximum, np.sin, np.exp, np.abs等等（详见：https://docs.scipy.org/doc/numpy/reference/routines.math.html）
大多数情况下，Numpy中的函数比Math库中类似的函数更高效，尤其是处理大规模数据时
1
2
3
4
5
6
x = np.array([[1,2],[3,4]])

print(np.sum(x)) # Compute sum of all elements;
print(np.sum(x, axis=0)) # Compute sum of each column;
print(np.sum(x, axis=1)) # Compute sum of each row;

2nd. Matplotlib:

Matplotlib是Python中最常用的可视化工具之一，可以非常方便地创建海量类型的2D图表和一些基本的3D图表
因为在函数的设计上参考了MATLAB，所以叫做Matplotlib
首次发表于2007年，是为了可视化癫痫病人的脑皮层电图相关的信号而研发的，原作者John D. Hunter博士是一名神经生物学家

最简单的图表

import matplotlib.pyplot as plt

plt.plot([1,2,3,4], [1,4,9,16], 'ro')
plt.axis([0, 6, 0, 20])
plt.show()

一张图表中多个函数（1）

import numpy as np
import matplotlib.pyplot as plt

t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

设置线条属性

#使用键值对参数：
plt.plot(x, y, linewidth=2.0)

#使用Line2D类对象的属性设置方法：
line, = plt.plot(x, y, '-')
line.set_antialiased(False) # turn off antialising

#使用setp()命令：
lines = plt.plot(x1, y1, x2, y2)
# use keyword args
plt.setp(lines, color='r', linewidth=2.0)
# or MATLAB style string value pairs
plt.setp(lines, 'color', 'r', 'linewidth', 2.0)

一张图表中多个函数（2）

import numpy as np
import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
plt.show()

添加文本

import numpy as np
import matplotlib.pyplot as plt

# Fixing random state for reproducibility
np.random.seed(19680801)

mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)

plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.show()

添加文本注释

import numpy as np
import matplotlib.pyplot as plt

ax = plt.subplot(111)

t = np.arange(0.0, 5.0, 0.01)
s = np.cos(2*np.pi*t)
line, = plt.plot(t, s, lw=2)

plt.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )

plt.ylim(-2,2)
plt.show()

多张图表：子图表

import numpy as np
import matplotlib.pyplot as plt

def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure(1)
plt.subplot(211) # 2 rows, 1 columns, index =1
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')

plt.subplot(212) # 2 rows, 1 columsn, index = 2
plt.plot(t2, np.cos(2*np.pi*t2), 'r--')
plt.show()

图像显示

import matplotlib.pyplot as plt

plt.figure('A Little White Dog')
little_dog_img = plt.imread('little_white_dog.jpg')
plt.imshow(little_dog_img)
Plt.show()

3rd. Pandas:

Pandas是python的一个数据分析包, 由AQR Capital Management于2008年4月开发，并于2009年底开源出来

导入惯例:

因为Series和DataFrame用的次数非常多，所以将其引入本地命名空间中会更方便
1
2
>>>from pandas import Series, DataFrame
>>>import pandas as pd

常用数据结构

Series：

一维标记数组，由一组数据（各种NumPy数据类型）以及一组与之相关的数据标签（即索引）组成。
类似于Numpy中的一维数组和Python的列表，不同之处是数组和series中存放的是相同类型的元素

创建Series：传入列表

• 默认整型索引

obj = Series([4, 7, -5, 3])
>>>obj
0    4
1    7
2   -5
3    3
dtype: int64

>>>obj.values
array([ 4,  7, -5,  3], dtype=int64)

>>>obj.index
RangeIndex(start=0, stop=4, step=1)

• 给定索引

obj2 = Series([4,7,-5,3], index=['d','b','a','c'])
>>>obj2
d    4
b    7
a   -5
c    3
dtype: int64

>>> obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')

访问Series中的元素

: 可以使用索引来选取Series中的单个或一组值

>>>obj2['a']
-5

>>>obj2['d']= 6

>>>obj2[['c','a','d']]
c    3
a   -5
d    6
dtype: int64

对Series的操作

: NumPy数组操作，如通过一个布尔数组过滤，纯量乘法，或使用数学函数，将会保持索引和值间的关联。
还可将Series看成是一个定长的有序字典，因为它是索引值到数据值的一个映射。它可以用在许多原本需要字典参数的函数中。

>>>obj2[obj2 > 0]
d    4
b    7
c    3
dtype: int64
>>>obj2*2
d     8
b    14
a   -10
c     6
dtype: int64
>>>np.exp(obj2)
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

>>> 'b' in obj2
True
'e' in obj2
>>> False

创建Series：传入字典

•如果数据被存放在一个Python字典中，也可以直接通过这个字典来创建Series
•如果只传入一个字典，则结果Series中的索引就是原字典的键（有序排列）
1
2
3
4
5
6
7
8
>>>sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
>>>obj3 = Series(sdata)
>>>obj3
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
•下例中，sdata跟states索引相匹配的那3个值会被找出来并放到相应的位置上，但由于“California”所对应的sdata值找不到，所以其结果就为NaN（Not A Number，非数字）
1
2
3
4
5
6
7
8
>>>states = ['California', 'Ohio', 'Oregon', 'Texas']
>>>obj4 = Series(sdata, index=states)
>>>obj4
California NaN
Ohio 35000
Oregon 16000
Texas 71000
dtype: float64

检测缺失数据

•pandas的isnull和notnull函数可用于检测缺失数据
•Series也提供了类似的实例方法，如obj4.isnull()

>>>pd.isnull(obj4) 
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
>>>pd.notnull(obj4)
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

自动对齐索引

•Series在算术运算中会自动对齐不同索引的数据

>>>obj3
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
>>>obj4
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64
>>>obj3 + obj4
California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64

Series对象及其索引的name

>>>obj4.name = 'population'

>>>obj4.index.name = 'state'

>>>obj4
state
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
Name: population, dtype: float64

修改索引

>>>obj
0    4
1    7
2   -5
3    3

>>>obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

>>>obj
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

DataFrame:

二维表格型数据结构，含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔值等），每列都有标签，可看成一个series的字典

创建DataFrame（1）

•DataFrame既有行索引也有列索引，它可以被看做由Series组成的字典（共用同一个索引）。
•最常用的创建方法是直接传入一个由等长列表或NumPy数组构成的字典

>>>data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
      'year':[2000, 2001, 2002, 2001, 2002],
      'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
>>>frame = DataFrame(data)
>>>frame
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

•如果指定了列序列，DataFrame的列就会按指定顺序排列

1	>>>DataFrame(data, columns=['year', 'state', 'pop'])

•跟Series一样，如果传入的列在数据中找不到，就会产生NaN值

>>>frame2=DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
            index=['one', 'two', 'three', 'four', 'five'])
>>>frame2
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

访问列

•通过字典记法或属性，可以将DataFrame的列获取为一个Series:

>>>frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
>>>frame2.year
one       2000
two      2001
three     2002
four      2001
five      2002
Name: year, dtype: int64

访问行

•行也可以使用一些方法通过位置或名字来检索，如loc（名字），iloc（位置）

>>>frame2.loc['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

>>> frame2.iloc[2]
Out[15]:
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

修改列

>>>frame2['debt'] = 16.5
>>>frame2
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5

>>>frame2['debt'] = np.arange(5)
>>>frame2
       year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
>>>val = Series([-1.2, -1.5, -1.7], index=[ 'two', 'four', 'five'])
>>>frame2['debt'] = val
>>>frame2
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7

增加列和删除列

>>>frame2['eastern'] = frame2.state == 'Ohio'
>>>frame2
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False

>>>del frame2['eastern']
>>>frame2
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7

创建DataFrame（2）

•传入嵌套字典（字典的字典），外部键会被解释为列索引，内部键会被解释为行索引：

>>>pop = {'Nevada': {2001: 2.4, 2002: 2.9},
        'Ohio': {2000: 1.5, 2001: 1.7, 2002:3.6}}
>>>frame3 = DataFrame(pop)
>>>frame3 
       Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

>>>frame3 = DataFrame(pop, index=[2001, 2002, 2003])
>>>frame3 
       Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN

缺失数据处理

•删除任何有缺失数据的行：

>>>frame3.dropna(how='any’)
       Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6

•对缺失值进行填充：

>>>frame3.fillna(value=5)
       Nevada  Ohio
2000     5.0   1.5
2001     2.4   1.7
2002     2.9   3.6

•判断哪些值是缺失值（nan）：

>>>pd.isna(frame3)
       Nevada  Ohio
2000    True   False
2001    False  False
2002    False  False

查看数据

•查看DataFrame前n行或后n行
frame.head(n)
frame.tail(n)

•查看DataFrame的索引、列以及底层的Numpy数据
frame.index
frame.columns
frame.values

•显示数据的快速统计汇总
frame.describe()

:对每一列数据进行统计，包括计数、均值、标准差、各个分位数等

•转置数据
frame.T

•对轴排序
frame.sort_index(axis=1, ascending=False)，其中axis=1表示对所有的columns进行排序，下面的数也跟着发生移动。

•对值排序
frame.sort_values(by=‘x’) 对x这一列，从小到大进行排序
选择行与列

•选取多行或多列：
frame[[‘state’, ‘pop’]]，选择’state’和’pop’两列
frame[0:3]，选择前三行

•loc用标签选择数据：
frame2.loc[‘one’]，选择索引为’one’的行
frame2.loc[‘one’, ‘pop’]，选择‘one’行，’pop’列
frame2.loc[:, [‘state’, ‘pop’]]，选择所有行，’state’和’pop’列
frame2.loc[[‘one’, ‘two’], [‘state’, ‘pop’]]，选择’one’和’two’行，’state’和’pop’列

•iloc用位置选择数据：
frame2.iloc[1:2, 1:2]
frame.iloc[[0,2], [1,2]]

•使用条件来选择：
frame[frame.year>2001]，选择year列中大于2001的数据
frame[frame>2001]，选择frame中所有大于2001的数据
frame[frame[‘year’].isin([‘2000’,‘2002’])]，选择year列的值为’2000’,’2002’的所有行
相关操作

•统计数据：
a.mean()，对DataFrame a的每一列数据值求平均值；a.mean(1)，则是对DataFrame a的每一行数据值求平均值
a[‘x’].value_counts()，统计列x中各值出现的次数

•对数据应用函数：
a.apply(lambda x:x.max()-x.min())，表示返回所有列中最大值和最小值的差

•字符串操作：
a[‘gender1’].str.lower()，将gender1中所有的英文转化为小写，注意dataframe没有str属性，只有series有，所以要选取a中的gender1列。
读取与写入文件

•写入.csv文件：
frame3.to_csv(‘C:\\Users\\qiuyu\\frame3.csv’)

•读取.csv文件：
frame4 = pd.read_csv('C:\\Users\\qiuyu\\frame3.csv')
frame4 = pd.read_csv('C:\\Users\\qiuyu\\frame3.csv', index_col=0)

+a ) Panel：

三维数组，可以理解为DataFrame的容器
Panel data源于经济学，也是pan(el)-da(ta)-s的名字来源
https://panel.holoviz.org/reference/widgets/DataFrame.html

4th. Scikit-learn:

https://scikit-learn.org/stable/

교수님 추천자료

•Python文档：https://docs.python.org/3/
•用Python玩转数据：https://www.coursera.org/learn/hipython/
•Codecademy：https://www.codecademy.com/learn/learn-python
•Dataquest：https://www.dataquest.io /

처음으로 수행한 문제들

Pandas활용 예제:

python기초 문제들

'''
1 给你一个时间t，t是一个字典，共有六个字符串键(year, month, day, hour, minute, second)，每个值为数字组成的字符串，
如t = {‘year’:‘2013’, ‘month’:‘9’, ‘day’:‘30’, ‘hour’:‘16’, ‘minute’:‘45’, ‘second’:‘2‘}。
请将其按照以下格式输出：XXXX-XX-XX XX:XX:XX。如上例应该输出：2013-09-30 16:45:02。
请将你的代码编辑成.py或者.ipynb文件提交。
'''
import datetime
date = datetime.datetime.now()
A = date.strftime('%Y-%m-%d')
B = date.strftime('%X')
print(f'{A} {B}')

t = {‘year’:‘2013’, ‘month’:‘9’, ‘day’:‘30’, ‘hour’:‘16’, ‘minute’:‘45’, ‘second’:‘2’}
print('{:4d}-{:0>2d}-{:0>2d}' '{:0>2d}:{:0>2d}:{:0>2d}'.format(t))



'''
    2 给你一个整数组成的列表L，按照下列条件输出：
若L是升序排列的,则输出"UP";
若L是降序排列的,则输出"DOWN";
若L无序，则输出"WRONG"。
请把你的代码编辑成.py或者.ipynb文件提交。
'''
A = input('请给我一个整数组成的列表L，我将按照下列条件输出结果。请输入L：\n若L是升序排列的,将输出"UP";\n若L是降序排列的,将输出"DOWN";\n若L无序，将输出"WRONG"。\n')
B = list(A)#https://dojang.io/mod/page/view.php?id=2286 #https://ghdwn0217.tistory.com/58
L = sorted(B, reverse = False) #https://vision-ai.tistory.com/24
R = sorted(B, reverse = True)#https://itholic.github.io/python-reverse-reversed/
if B == L:
    print('UP')
elif B == R:
    print('DOWN')
else:
    print('WRONG')



'''
    3. 互联网上的每台计算机都有一个IP，合法的IP格式为：A.B.C.D。其中A、B、C、D均为[0, 255]中的整数。
    为了简单起见，我们规定这四个整数中不允许有前导零存在，如001。
    现在给你一个字符串s（s不含空白符），请你判断s是不是合法IP，若是，输出Yes，否则输出No。
    如：s=“202.115.32.24”，则输出Yes；s=“a.11.11.11”, 则输出No。
请把你的代码编辑成.py或.ipynb文件提交。
'''
import sys
A = input('Please enter a number for it become new IP address(1 / 4):')
B = input('Please enter a number for it become new IP address(2 / 4):')
C = input('Please enter a number for it become new IP address(3 / 4):')
D = input('Please enter a number for it become new IP address(4 / 4):')
s = ''
L = []
for i in range(0, 256):
    L.append(str(i))
def diagnose():
    global A
    global B
    global C
    global D
    if A in L:
        while A in L:
            while B in L:
                while C in L:
                    while D in L:
                        s = f'{A}.{B}.{C}.{D}'
                        print('Valid IP was inputed. Please wait...')
                        print(f'Yes! Now we have a new IP address which is made by yourself!: ')
                        print(f'{s}')
                        sys.exit()
                    D = input(f'No, the number 4/4 is invalid. Please enter a valid number, which is : 0 ~ {len(L)-1}\n')
                    diagnose()
                    break
                C = input(f'No, the number 3/4 is invalid. Please enter a valid number, which is : 0 ~ {len(L)-1}\n')
                diagnose()
                break
            B = input(f'No, the number 2/4 is invalid. Please enter a valid number, which is : 0 ~ {len(L)-1}\n')
            diagnose()
            break
    else:
        A = input(f'No, the number 1/4 is invalid. Please enter a valid number, which is : 0 ~ {len(L)-1}\n')
        diagnose()
diagnose()

Donate

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.

머신러닝을위한python기초

자주 쓰는 개발환경

math module_math 모듈_math 模块

문서 호출 및 편집

주로 사용할 패키지_Numpy,Matplotlib,Pandas……

1st. Numpy(Numercial Python Extensions) :

A frequent error consists in calling array with multiple arguments, rather than providing a single sequence as an argument.

基本的算术运算都作用在数组的元素级别

2nd. Matplotlib:

3rd. Pandas:

•判断哪些值是缺失值（nan）：

•查看DataFrame前n行或后n行

•查看DataFrame的索引、列以及底层的Numpy数据

•显示数据的快速统计汇总

•转置数据

•对轴排序

•对值排序

•选取多行或多列：

•loc用标签选择数据：

•iloc用位置选择数据：

•使用条件来选择：

•统计数据：

•对数据应用函数：

•字符串操作：

•写入.csv文件：

•读取.csv文件：