Stata Python 融合应用

Hua Peng@StataCorp

2020 Stata 中国用户大会

https://huapeng01016.github.io/china1-2020/

Stata 16与Python的紧密结合

  • 互动式运行Python程序
  • 在do-file与ado-file中定义与运行Python程序
  • Python与Stata通过Stata Function Interface (sfi)互动

互动式执行Python

Hello World!

. python:
----------------------------------------------- python (type end to exit) -----------------------------------
>>> print('Hello World!')
Hello World!
>>> end
-------------------------------------------------------------------------------------------------------------

for 循环

Stata与其他Python环境一样,输入Python语句需要正确使用“缩进”。

. python:
----------------------------------------------- python (type end to exit) -----------------------------------
>>> sum = 0
>>> for i in range(7):
...     sum = sum + i
>>> print(sum)
21
>>> end
-------------------------------------------------------------------------------------------------------------

sfi

. python:
----------------------------------------------- python (type end to exit) -----------------------------------
>>> from functools import reduce
>>> from sfi import Data, Macro
>>> 
>>> stata: quietly sysuse auto, clear
>>> 
>>> sum = reduce((lambda x, y: x + y), Data.get(var='price'))
>>> 
>>> Macro.setLocal('sum', str(sum))
>>> end
-------------------------------------------------------------------------------------------------------------

. display "sum of var price is : `sum'"
sum of var price is : 456229

更多sfi

. python:
----------------------------------------------- python (type end to exit) -----------------------------------
>>> sum1 = reduce((lambda x, y: x + y), Data.get(var='rep78'))
>>> sum1
inf
>>> sum2 = reduce((lambda x, y: x + y), Data.get(var='rep78', selectvar=-1))
>>> sum2
235
>>> end
-------------------------------------------------------------------------------------------------------------

使用Python模块

  • Pandas
  • Numpy
  • Matplotlib, Plotly
  • BeautifulSoup, lxml
  • Scikit-Learn, Tensorflow, Keras
  • NLTK,jieba

网络数据的抓取与显示

抓取Covid-19数据

local date = "07-30-2020"
python:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/"\
    "CSSEGISandData/COVID-19/master/csse_covid_19_data/"\
    "csse_covid_19_daily_reports/`date'.csv",\
    dtype={"fips" : np.int32})
df.columns = df.columns.str.lower()
df = df.loc[df['country_region'] == "US"]
df.head()
end

使用geopandasplotly显示数据

python:
from urllib.request import urlopen
import numpy as np
import json
with urlopen("https://raw.githubusercontent.com/"\
    "plotly/datasets/master/geojson-counties-fips.json") as response:
    counties = json.load(response)

import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/"\
    "CSSEGISandData/COVID-19/master/csse_covid_19_data/"\
    "csse_covid_19_daily_reports/`date'.csv",\
    dtype={"fips" : np.int32})
df.columns = df.columns.str.lower()
df = df.loc[df['country_region'] == "US"]
import plotly.express as px
fig = px.choropleth(df, geojson=counties, locations='fips', 
                        color='confirmed',
                        hover_data=['combined_key', 'confirmed'],
                        color_continuous_scale='Inferno',
                        range_color = [100, 5000],
                        scope="usa",
                        labels={'confirmed':'confirmed cases'}
                    )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
# fig.write_html("./stata/`date'-`state'.html")
end

三维曲面图

导入Python模块

. python:
----------------------------------------------- python (type end to exit) -----------------------------------
>>> import numpy as np
>>> from sfi import Platform
>>> 
>>> import matplotlib
>>> if Platform.isWindows():
...         matplotlib.use('TkAgg')
... 
>>> import matplotlib.pyplot as plt
>>> from mpl_toolkits import mplot3d
>>> from sfi import Data
>>> end
-------------------------------------------------------------------------------------------------------------

使用sfi.Data导入数据

. use https://www.stata-press.com/data/r16/sandstone, clear
(Subsea elevation of Lamont sandstone in an area of Ohio)

. * Use sfi to get data from Stata
. python:
----------------------------------------------- python (type end to exit) -----------------------------------
>>> D = np.array(Data.get("northing easting depth"))
>>> end
-------------------------------------------------------------------------------------------------------------

使用三角网画图

python:
ax = plt.axes(projection='3d')
ax.xaxis.xticks(np.arange(60000, 90001, step=10000))
ax.yaxis.yticks(np.arange(30000, 50001, step=5000))
ax.plot_trisurf(D[:,0], D[:,1], D[:,2], cmap='viridis', edgecolor='none')
plt.savefig("sandstone.png")
end

sandstone.png
sandstone.png

改变颜色和视角

python:
ax.plot_trisurf(D[:,0], D[:,1], D[:,2],
    cmap=plt.cm.Spectral, edgecolor='none')
ax.view_init(30, 60)
plt.savefig("sandstone1.png")
end

sandstone.png
sandstone.png

动画 (do-file)