I want to convert the html to csv using pandas functions
This is a part of what I read in the dataframe df
0 1
0 sequence 2
1 trainNo K805
2 trainNumber K805
3 departStation 鹰潭
4 departStationPy yingtan
5 arriveStation 南昌
6 arriveStationPy nanchang
7 departDate 2020-05-24
8 departTime 03:55
9 arriveDate 2020-05-24
10 arriveTime 05:44
11 isStartStation False
12 isEndStation False
13 runTime 1小时49分钟
14 preSaleTime NaN
15 takeDays 0
16 isBookable True
17 seatList seatNamepriceorderPriceinventoryisBookablebutt...
18 curSeatIndex 0
seatName price orderPrice inventory isBookable buttonDisplayName buttonType
0 硬座 23.5 23.5 99 True NaN 0
1 硬卧 69.5 69.5 99 True NaN 0
2 软卧 104.5 104.5 4 True NaN 0
0 1
0 departDate 2020-05-23
1 departStationList NaN
2 endStationList NaN
3 departStationFilterMap NaN
4 endStationFilterMap NaN
5 departCityName 上海
6 arriveCityName 南昌
7 gtMinPrice NaN
My code is like this
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(".\other.csv",index=True,encoding='utf-8-sig')
To preserve the characters in csv, I need to use utf-8-sig encoding. But I don't know how to write the format symbol %
,0,1
0,departDate,2020-05-23
1,departStationList,
2,endStationList,
3,departStationFilterMap,
4,endStationFilterMap,
5,departCityName,上海
6,arriveCityName,南昌
7,gtMinPrice,
This is what I got in csv file, only the last part is preserved.
The dataframe is correct, while the csv need correction. Can you show me how to make the correct output?
you're saving each dataframe to the same file, so each is getting overwritten until the last one.
note the addition of the f-string to change the save file name e.g. f".\other_{i}.csv"
each dataframe is a different shape, so they won't all fit together properly
To CSV
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(f".\other_{i}.csv", index=True, encoding='utf-8-sig')
To Excel
with pd.ExcelWriter('output.xlsx', mode='w') as writer:
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_excel(writer, sheet_name=f'Sheet{i}', encoding='utf-8-sig')
Related
I have several csv files which have data of voltage over time and each csv files are approximately 7000 rows and the data looks like this:
Time(us) Voltage (V)
0 32.96554106
0.5 32.9149649
1 32.90484966
1.5 32.86438874
2 32.8542735
2.5 32.76323642
3 32.74300595
3.5 32.65196886
4 32.58116224
4.5 32.51035562
5 32.42943376
5.5 32.38897283
6 32.31816621
6.5 32.28782051
7 32.26759005
7.5 32.21701389
8 32.19678342
8.5 32.16643773
9 32.14620726
9.5 32.08551587
10 32.04505495
10.5 31.97424832
11 31.92367216
11.5 31.86298077
12 31.80228938
12.5 31.78205891
13 31.73148275
13.5 31.69102183
14 31.68090659
14.5 31.67079136
15 31.64044567
15.5 31.59998474
16 31.53929335
16.5 31.51906288
I read the csv file with pandas dataframe and after plotting a figure in matplotlib with data from one csv file, the figure looks like below.
I would like to split every single square waveform/bit and store the corresponding voltage values for each bit separately. So the resulting voltage values of each bit would be stored in a row and should look like this:
I don't have any idea how to do that. I guess I have to write a function where I have to assign a threshold value that, if the voltage values are going down for maybe 20 steps of time than capture all the values or if the voltage level is going up for 20 steps of time than capture all the voltage values. Could someone help?
If you get the gradient of your Voltage (here using diff as the time is regularly spaced), this gives you the following:
You can thus easily use a threshold (I tested with 2) to identify the peak starts. Then pivot your data:
# get threshold of gradient
m = df['Voltage (V)'].diff().gt(2)
# group start = value above threshold preceded by value below threshold
group = (m&~m.shift(fill_value=False)).cumsum().add(1)
df2 = (df
.assign(id=group,
t=lambda d: d['Time (us)'].groupby(group).apply(lambda s: s-s.iloc[0])
)
.pivot(index='id', columns='t', values='Voltage (V)')
)
output:
t 0.0 0.5 1.0 1.5 2.0 2.5 \
id
1 32.965541 32.914965 32.904850 32.864389 32.854273 32.763236
2 25.045314 27.543777 29.182444 30.588462 31.114454 31.984364
3 25.166697 27.746081 29.415095 30.719960 31.326873 32.125977
4 25.277965 27.877579 29.536477 30.912149 31.367334 32.206899
5 25.379117 27.978732 29.667975 30.780651 31.670791 32.338397
6 25.631998 27.634814 28.959909 30.173737 30.659268 31.053762
7 23.528030 26.137759 27.948386 29.253251 30.244544 30.649153
8 23.639297 26.380525 28.464263 29.971432 30.902034 31.458371
9 23.740449 26.542369 28.707028 30.295120 30.881803 31.862981
10 23.871948 26.673867 28.889103 30.305235 31.185260 31.873096
11 24.387824 26.694097 28.342880 29.678091 30.315350 31.134684
...
t 748.5 749.0
id
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 21.059913 21.161065
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
11 NaN NaN
[11 rows x 1499 columns]
plot:
df2.T.plot()
I have data as follows:
import pandas as pd
url_cities="https://population.un.org/wup/Download/Files/WUP2018-F12-Cities_Over_300K.xls"
df_cities = pd.read_excel(url_cities)
i = df_cities.iloc[:, 1].notna().idxmax()
df_cities.columns = df_cities.iloc[i].tolist()
df_cities = df_cities.iloc[i+1:]
df_cities = df_cities.rename(columns={2020.0: 'City_pop'})
print(df_cities.iloc[0:20,])
I want to remove all columns for which the column names (NOT COLUMN VALUES) are floats.
I have looked at a couple of links (A, B, C), but I could not find the answer. Any suggestions?
This will do what your question asks:
df = df[[col for col in df.columns if not isinstance(col, float)]]
Example:
import pandas as pd
df = pd.DataFrame(columns=['a',1.1,'b',2.2,3,True,4.4,'c'],data=[[1,2,3,4,5,6,7,8],[11,12,13,14,15,16,17,18]])
print(df)
df = df[[col for col in df.columns if not isinstance(col, float)]]
print(df)
Initial dataframe:
a 1.1 b 2.2 3 True 4.4 c
0 1 2 3 4 5 6 7 8
1 11 12 13 14 15 16 17 18
Result:
a b 3 True c
0 1 3 5 6 8
1 11 13 15 16 18
Note that 3 is an int, not a float, so its column has not been removed.
my_list=list(df_cities.columns)
for i in my_list:
if type(i)!=str:
df_cities=df_cities.drop(columns=[i],axis=1)
please, try this code
I think your basic problem is the call to read the excel file.
If you skip early rows and define the index correctly6, you avoid the issue of having to remove float column headers altogether.
so change your call to open the excel file to the following:
df_cities = pd.read_excel(url_cities, skiprows=16, index_col=0)
Which yields a df like the following:
Country Code Country or area City Code Urban Agglomeration Note Latitude Longitude 1950 1955 1960 ... 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035
Index
1 4 Afghanistan 20001 Herat NaN 34.348170 62.199670 82.468 85.751 89.166 ... 183.465 207.190 233.991 275.678 358.691 466.703 605.575 752.910 897.041 1057.573
2 4 Afghanistan 20002 Kabul NaN 34.528887 69.172460 170.784 220.749 285.352 ... 1549.320 1928.694 2401.109 2905.178 3289.005 3723.543 4221.532 4877.024 5737.138 6760.500
3 4 Afghanistan 20003 Kandahar NaN 31.613320 65.710130 82.199 89.785 98.074 ... 233.243 263.395 297.456 336.746 383.498 436.741 498.002 577.128 679.278 800.461
4 4 Afghanistan 20004 Mazar-e Sharif NaN 36.709040 67.110870 30.000 37.139 45.979 ... 135.153 152.629 172.372 206.403 283.532 389.483 532.689 681.531 816.040 962.262
I am trying to calculate the means of all previous rows for each column of the DataFrame and add the calculated mean column to the DataFrame.
I am using a set of nba games data that contains 20+ features (columns) that I am trying to calculate the means for. Example of the dataset is below. (Note. "...." represent rest of the feature columns)
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Example for calculating two of the columns:
dataset = pd.read_csv('nba.games.stats.csv')
df = dataset
df['Game_mean'] = (df.groupby('Team')['TeamPoints'].apply(lambda x: x.shift().expanding().mean()))
df['TeamPoints_mean'] = (df.groupby('Team')['OpponentsPoints'].apply(lambda x: x.shift().expanding().mean()))
Again, the code only calculates the mean and adding the column to the DataFrame one at a time. Is there a way to get the column means and add them to the DataFrame without doing one at a time? For loop? Example of what I am looking for is below.
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean ...("..." = mean columns of rest of the feature columns)
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Try this one:
(0) sample input:
>>> df
col1 col2 col3
0 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797
2 0.042541 1.196383 6.568839
3 4.784911 0.444671 8.019933
4 3.831556 0.902672 0.198920
5 3.672763 2.236639 1.528215
6 0.792616 2.604049 0.373296
7 2.281992 2.563639 1.500008
8 4.096861 0.598854 4.934116
9 3.632607 1.502801 0.241920
Then processing:
(1) side table to get all the means on the side (I didn't find cummulative mean function, so went with cumsum + count)
>>> df_side=df.assign(col_temp=1).cumsum()
>>> df_side
col1 col2 col3 col_temp
0 1.490977 1.784433 0.852842 1.0
1 5.217640 4.629801 8.619638 2.0
2 5.260182 5.826184 15.188477 3.0
3 10.045093 6.270855 23.208410 4.0
4 13.876649 7.173527 23.407330 5.0
5 17.549412 9.410166 24.935545 6.0
6 18.342028 12.014215 25.308841 7.0
7 20.624021 14.577855 26.808849 8.0
8 24.720882 15.176708 31.742965 9.0
9 28.353489 16.679509 31.984885 10.0
>>> for el in df.columns:
... df_side["{}_mean".format(el)]=df_side[el]/df_side.col_temp
>>> df_side=df_side.drop([el for el in df.columns] + ["col_temp"], axis=1)
>>> df_side
col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842
1 2.608820 2.314901 4.309819
2 1.753394 1.942061 5.062826
3 2.511273 1.567714 5.802103
4 2.775330 1.434705 4.681466
5 2.924902 1.568361 4.155924
6 2.620290 1.716316 3.615549
7 2.578003 1.822232 3.351106
8 2.746765 1.686301 3.526996
9 2.835349 1.667951 3.198489
(2) joining back, on index:
>>> df_final=df.join(df_side)
>>> df_final
col1 col2 col3 col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797 2.608820 2.314901 4.309819
2 0.042541 1.196383 6.568839 1.753394 1.942061 5.062826
3 4.784911 0.444671 8.019933 2.511273 1.567714 5.802103
4 3.831556 0.902672 0.198920 2.775330 1.434705 4.681466
5 3.672763 2.236639 1.528215 2.924902 1.568361 4.155924
6 0.792616 2.604049 0.373296 2.620290 1.716316 3.615549
7 2.281992 2.563639 1.500008 2.578003 1.822232 3.351106
8 4.096861 0.598854 4.934116 2.746765 1.686301 3.526996
9 3.632607 1.502801 0.241920 2.835349 1.667951 3.198489
I am trying to calculate the means of all previous rows for each column of the DataFrame
To get all of the columns, you can do:
df_means = df.join(df.cumsum()/
df.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
However, if Team is a column rather the index, you'd want to get rid of it:
df_data = df.drop('Teams', axis=1)
df_means = df.join(df_data.cumsum()/
df_data.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
You could also do
import numpy as np
df_data = df[[col for col in df.columns
if np.issubdtype(df[col],np.number)]]
Or manually define a list of columns that you want to take the mean of, cols_for_mean, and then do
df_data = df[cols_for_mean]
The Overview:
In our project, we are working with a CSV file that contains some data. We will call it smal.csv It is a bit of a chunky file that will be later used for some other algorithms. (Here is the gist in case the link to smal.csv is too badly formatted for your browser.)
The file will be loaded like this
filename = "smal.csv"
keyname = "someKeyname"
self.data[keyname] = spectral_data(pd.read_csv(filename, header=[0, 1], verbose=True))
The spectral class looks like this. As you can see, we do not actually keep the dataframe as is.
class spectral_data(object):
def __init__(self, df):
try:
uppercols = df.columns.levels[0]
lowercols = list(df.columns.levels[1].values)
except:
df.columns = pd.MultiIndex.from_tuples(list(df.columns))
uppercols = df.columns.levels[0]
lowercols = list(df.columns.levels[1].values)
for i, val in enumerate(lowercols):
try:
lowercols[i] = float(val)
except:
lowercols[i] = val
levels = [uppercols, lowercols]
df.columns.set_levels(levels, inplace=True)
self.df = df
After we've loaded it we'd like to concatenate it with another set of data, also loaded like smal.csv was.
Our concatenation is done like this.
new_df = pd.concat([self.data[dataSet1].df, self.data[dataSet2].df], ignore_index=True)
However, the ignore_index=True does not work, because the actual row that we are concatenating is not the index. However, we cannot simply remove the column, it is necessary for other parts of our program.
The Objective:
I'm trying to concatenate a couple of data frames together, however, what I thought was the index is not actually the index for the data frame. Thus the command
pd.concat([df1.df, df2.df], ignore_index=True)
will not work. I thought maybe using iloc to change each individual cell would work but I feel like this is not the most intuitive way to approach this.
How can I get a data frame that looks like this
[396 rows x 6207 columns]
Unnamed: 0_level_0 meta ... wvl
Unnamed: 0_level_1 Sample ... 932.695 932.89
0 1 NaN ... -12.33 9.67
1 2 NaN ... 11.94 3.94
2 3 NaN ... -2.67 28.33
3 4 NaN ... 53.22 -13.78
4 1 NaN ... 43.28
5 2 NaN ... 41.33 47.33
6 3 NaN ... -21.94 12.06
7 4 NaN ... -30.94 -1.94
8 5 NaN ... -24.78 40.22
Turn into this.
[396 rows x 6207 columns]
Unnamed: 0_level_0 meta ... wvl
Unnamed: 0_level_1 Sample ... 932.695 932.89
0 1 NaN ... -12.33 9.67
1 2 NaN ... 11.94 3.94
2 3 NaN ... -2.67 28.33
3 4 NaN ... 53.22 -13.78
4 5 NaN ... 43.28
5 6 NaN ... 41.33 47.33
6 7 NaN ... -21.94 12.06
7 8 NaN ... -30.94 -1.94
8 9 NaN ... -24.78 40.22
I have a .dat file, about whose origin I am not sure. I have to read this file in order to perform PCA. Assuming it to be white spaced file, I was successfully able to read the contents of file and ignore the first column (as it is a index), but the very first row. Below is the code:
import numpy as np
import pandas as pd
from numpy import array
myarray = pd.read_csv('hand_postures.dat', delim_whitespace=True)
myarray = array(myarray)
print(myarray.shape)
myarray = np.delete(myarray,0,1)
print(myarray)
print(myarray.shape)
The file is shared at the link https://drive.google.com/open?id=0ByLV3kGjFP_zekN1U1c3OGFrUnM. Can someone help me point out my mistake?
You need an extra parameter when calling pd.read_csv.
df = pd.read_csv('hand_postures.dat', header=None, delim_whitespace=True, index_col=[0])
df.head()
1 2 3 4 5 6 7 8 \
0
0 -65.55560 0.172413 44.4944 22.2472 0.000000 50.6723 34.3434 17.1717
1 -65.55560 2.586210 43.8202 21.9101 0.277778 51.4286 34.3434 17.1717
2 -45.55560 5.000000 43.8202 21.9101 0.833333 56.7227 42.4242 21.2121
3 5.55556 -2.241380 46.5169 23.2584 1.111110 70.3361 85.8586 42.9293
4 67.77780 20.689700 59.3258 29.6629 2.222220 80.9244 93.9394 46.9697
9 10 11 12 13 14 15 16 \
0
0 -0.235294 54.6154 39.7849 19.8925 0.705883 37.2656 41.3043 20.6522
1 -0.235294 55.3846 38.7097 19.3548 0.705883 38.6719 41.3043 20.6522
2 0.000000 63.0769 47.3118 23.6559 0.000000 47.8125 54.3478 27.1739
3 -0.117647 83.8462 90.3226 45.1613 0.352941 73.1250 92.3913 46.1957
4 0.117647 93.8462 98.9247 49.4624 -0.352941 89.2969 100.0000 50.0000
17 18 19 20
0
0 15.0 34.6584 54.1270 27.0635
1 14.4 35.2174 55.8730 27.9365
2 14.4 43.6025 69.8413 34.9206
3 3.6 73.7888 94.2857 47.1429
4 -1.2 92.2360 106.5080 53.2540
header=None specifies that the first row is part of the data (and not the header)
index_col=[0] specifies that the first column is to be treated as the index