Unwanted white spaces resulting into distorted column

Unwanted white spaces resulting into distorted column - python

I am trying to import a list of chemicals from a txt file which is spaced (not tabbed).
NO FORMULA NAME CAS No A B C D TMIN TMAX code ngas#TMIN ngas#25 C ngas#TMAX
1 CBrClF2 bromochlorodifluoromethane 353-59-3 -0.0799 4.9660E-01 -6.3021E-05 -9.0961E-09 200 1500 2 96.65 142.14 572.33
2 CBrCl2F bromodichlorofluoromethane 353-58-2 4.0684 4.1343E-01 1.6576E-05 -3.4388E-08 200 1500 2 87.14 127.90 545.46
3 CBrCl3 bromotrichloromethane 75-62-7 7.3767 3.5056E-01 6.9163E-05 -4.9571E-08 200 1500 2 79.86 116.73 521.53
4 CBrF3 bromotrifluoromethane 75-63-8 -9.5253 6.5020E-01 -3.4459E-04 1.0987E-07 230 1500 1,2 123.13 156.61 561.26
5 CBr2F2 dibromodifluoromethane 75-61-6 2.8167 4.9405E-01 -1.2627E-05 -2.8629E-08 200 1500 2 100.89 148.24 618.87
6 CBr4 carbon tetrabromide 558-13-4 10.6812 3.2869E-01 1.0739E-04 -6.0788E-08 200 1500 2 80.23 116.62 540.18
7 CClF3 chlorotrifluoromethane 75-72-9 13.8075 4.7487E-01 -1.3368E-04 2.2485E-08 230 1500 1,2 116.23 144.10 501.22
8 CClN cyanogen chloride 506-77-4 0.8665 3.6619E-01 -2.9975E-05 -1.3191E-08 200 1500 2 72.80 107.03 438.19
When I import with pandas
df = pd.read_csv('trial1.txt', sep='\s')
I get:
For first 5 compounds (index 0-4) name is correctly in Name column but for 6th (index 5) and 8th (index 7) compounds - their name is divided because of space and it goes to CAS. Causing CAS column value to go under No column and values and so on subsequently.
Is there a way to eliminate this issue ? Thank you

I would suggest you put some work on the 'trial1.txt' file before loading it to df. The following code will result to what you finally want to get:
with open ('trial1.txt') as f:
l=f.readlines()
l=[i.split() for i in l]
target=len(l[1])
for i in range(1,len(l)):
if len(l[i])>target:
l[i][2]=l[i][2]+' '+l[i][3]
l[i].pop(3)
l=['#'.join(k) for k in l] #supposing that there is no '#' in your entire file, otherwise use some other rare symbol that doesn't eist in your file
l=[i+'\n' for i in l]
with open ('trial2.txt', 'w') as f:
f.writelines(l)
df = pd.read_csv('trial2.txt', sep='#', index_col=0)

Try this:
You basically have to strip out the spaces between words in the name column. So here I first read the file and then strip out the spaces in the NAME column using re.sub.
In this code, I am assuming that words are separated by atleast 5 letters on either sides. You can change that number {5} as you deem fit.
import re
with open('trial1.txt', 'r') as f:
lines = f.readlines()
l = [re.sub(r"([a-z]{5,})\s([a-z]{5,})", r"\1\2", line) for line in lines]
df = pd.read_csv(io.StringIO('\n'.join(l)), delim_whitespace=True)
Prints:
NO FORMULA NAME CAS No A B C D TMIN TMAX code ngas#TMIN ngas#25 C.1 ngas#TMAX
1 CBrClF2 bromochlorodifluoromethane 353-59-3 -0.0799 0.49660 -0.000063 -9.096100e-09 200 1500 2 96.65 142.14 572.33 NaN NaN
2 CBrCl2F bromodichlorofluoromethane 353-58-2 4.0684 0.41343 0.000017 -3.438800e-08 200 1500 2 87.14 127.90 545.46 NaN NaN
3 CBrCl3 bromotrichloromethane 75-62-7 7.3767 0.35056 0.000069 -4.957100e-08 200 1500 2 79.86 116.73 521.53 NaN NaN
4 CBrF3 bromotrifluoromethane 75-63-8 -9.5253 0.65020 -0.000345 1.098700e-07 230 1500 1,2 123.13 156.61 561.26 NaN NaN
5 CBr2F2 dibromodifluoromethane 75-61-6 2.8167 0.49405 -0.000013 -2.862900e-08 200 1500 2 100.89 148.24 618.87 NaN NaN
6 CBr4 carbontetrabromide 558-13-4 10.6812 0.32869 0.000107 -6.078800e-08 200 1500 2 80.23 116.62 540.18 NaN NaN
7 CClF3 chlorotrifluoromethane 75-72-9 13.8075 0.47487 -0.000134 2.248500e-08 230 1500 1,2 116.23 144.10 501.22 NaN NaN
8 CClN cyanogenchloride 506-77-4 0.8665 0.36619 -0.000030 -1.319100e-08 200 1500 2 72.80 107.03 438.19 NaN NaN

Related

How to split a dataframe containing voltage over time value, so that it can store values of each waveform/bit separately

I have several csv files which have data of voltage over time and each csv files are approximately 7000 rows and the data looks like this:
Time(us) Voltage (V)
0 32.96554106
0.5 32.9149649
1 32.90484966
1.5 32.86438874
2 32.8542735
2.5 32.76323642
3 32.74300595
3.5 32.65196886
4 32.58116224
4.5 32.51035562
5 32.42943376
5.5 32.38897283
6 32.31816621
6.5 32.28782051
7 32.26759005
7.5 32.21701389
8 32.19678342
8.5 32.16643773
9 32.14620726
9.5 32.08551587
10 32.04505495
10.5 31.97424832
11 31.92367216
11.5 31.86298077
12 31.80228938
12.5 31.78205891
13 31.73148275
13.5 31.69102183
14 31.68090659
14.5 31.67079136
15 31.64044567
15.5 31.59998474
16 31.53929335
16.5 31.51906288
I read the csv file with pandas dataframe and after plotting a figure in matplotlib with data from one csv file, the figure looks like below.
I would like to split every single square waveform/bit and store the corresponding voltage values for each bit separately. So the resulting voltage values of each bit would be stored in a row and should look like this:
I don't have any idea how to do that. I guess I have to write a function where I have to assign a threshold value that, if the voltage values are going down for maybe 20 steps of time than capture all the values or if the voltage level is going up for 20 steps of time than capture all the voltage values. Could someone help?

If you get the gradient of your Voltage (here using diff as the time is regularly spaced), this gives you the following:
You can thus easily use a threshold (I tested with 2) to identify the peak starts. Then pivot your data:
# get threshold of gradient
m = df['Voltage (V)'].diff().gt(2)
# group start = value above threshold preceded by value below threshold
group = (m&~m.shift(fill_value=False)).cumsum().add(1)
df2 = (df
.assign(id=group,
t=lambda d: d['Time (us)'].groupby(group).apply(lambda s: s-s.iloc[0])
)
.pivot(index='id', columns='t', values='Voltage (V)')
)
output:
t 0.0 0.5 1.0 1.5 2.0 2.5 \
id
1 32.965541 32.914965 32.904850 32.864389 32.854273 32.763236
2 25.045314 27.543777 29.182444 30.588462 31.114454 31.984364
3 25.166697 27.746081 29.415095 30.719960 31.326873 32.125977
4 25.277965 27.877579 29.536477 30.912149 31.367334 32.206899
5 25.379117 27.978732 29.667975 30.780651 31.670791 32.338397
6 25.631998 27.634814 28.959909 30.173737 30.659268 31.053762
7 23.528030 26.137759 27.948386 29.253251 30.244544 30.649153
8 23.639297 26.380525 28.464263 29.971432 30.902034 31.458371
9 23.740449 26.542369 28.707028 30.295120 30.881803 31.862981
10 23.871948 26.673867 28.889103 30.305235 31.185260 31.873096
11 24.387824 26.694097 28.342880 29.678091 30.315350 31.134684
...
t 748.5 749.0
id
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 21.059913 21.161065
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
11 NaN NaN
[11 rows x 1499 columns]
plot:
df2.T.plot()

Expanding a column of pandas dataframe with variable number of elements and leading texts

I am trying to expand a column of a pandas dataframe
(see column segments in example below.)
I am able to break it out into the components seperated by ;
However, as you can see, some of the rows in the columns do
not have all the elements. So, what is happening is that the
data which should go into the Geo column ends up going into the
BusSeg column, since there was no Geo column; or the data
that should be in ProdServ column ends up in the Geo column.
Ideally I would like to have only the data and not the indicator
in each cell correctly placed. So,
In the Geo column it should say 'NonUs'. Not 'Geo=NonUs.'
That is after seperating correctly, I would like to remove the text
upto and including the '=' sign in each. How can I do this?
Code below:
import pandas as pd
company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
'Subseg=Tr1',
'BusSeg=Pharma',
'Geo=China;Prd=Alpha;Subseg=Tr4;',
'Prd=Beta;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
'BusSeg=Pharma;Geo=NonUs;']
print("\ndf1:")
df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True)
print(df1)
print(df1[['BusSeg','Geo','ProdServ','Sub','Misc']])
print(df1.dtypes)
print()

Your Data
import pandas as pd
company1 = ('Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev','Rev')
df1 = pd.DataFrame(columns=None)
df1['company'] = company1
df1['clv']=[500,200,3000,400,10,300,560,500,600]
df1['date'] = [20191231,20191231,20191231,20181231,20181231,20181231,20171231,20171231,20171231 ]
df1['line'] = [1,3,2,1,3,2,1,3,2]
df1['segments'] =['BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1',
'BusSeg=Dev;Prd=Alpha;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2',
'Subseg=Tr1',
'BusSeg=Pharma',
'Geo=China;Prd=Alpha;Subseg=Tr4;',
'Prd=Beta;Subseg=Tr1',
'BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;',
'BusSeg=Pharma;Geo=NonUs;']
df:
company clv date line segments
0 Rev 500 20191231 1 BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1
1 Rev 200 20191231 3 BusSeg=Dev;Prd=Alpha;Subseg=Tr1
2 Rev 3000 20191231 2 BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2
3 Rev 400 20181231 1 Subseg=Tr1
4 Rev 10 20181231 3 BusSeg=Pharma
5 Rev 300 20181231 2 Geo=China;Prd=Alpha;Subseg=Tr4;
6 Rev 560 20171231 1 Prd=Beta;Subseg=Tr1
7 Rev 500 20171231 3 BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1;
8 Rev 600 20171231 2 BusSeg=Pharma;Geo=NonUs;
Comment this line df1[['BusSeg','Geo','ProdServ','Sub','Misc']] = df1['segments'].str.split(';',expand=True) in your code, and add theese two lines
d = pd.DataFrame(df1['segments'].str.split(';').apply(lambda x:{i.split("=")[0] : i.split("=")[1] for i in x if i}).to_dict()).T
df = pd.concat([df1, d], axis=1)
df:
company clv date line segments BusSeg Geo Prd Subseg
0 Rev 500 20191231 1 BusSeg=Pharma;Geo=NonUs;Prd=Alpha;Subseg=Tr1 Pharma NonUs Alpha Tr1
1 Rev 200 20191231 3 BusSeg=Dev;Prd=Alpha;Subseg=Tr1 Dev NaN Alpha Tr1
2 Rev 3000 20191231 2 BusSeg=Pharma;Geo=US;Prd=Alpha;Subseg=Tr2 Pharma US Alpha Tr2
3 Rev 400 20181231 1 Subseg=Tr1 NaN NaN NaN Tr1
4 Rev 10 20181231 3 BusSeg=Pharma Pharma NaN NaN NaN
5 Rev 300 20181231 2 Geo=China;Prd=Alpha;Subseg=Tr4; NaN China Alpha Tr4
6 Rev 560 20171231 1 Prd=Beta;Subseg=Tr1 NaN NaN Beta Tr1
7 Rev 500 20171231 3 BusSeg=Pharma;Geo=US;Prd=Delta;Subseg=Tr1; Pharma US Delta Tr1
8 Rev 600 20171231 2 BusSeg=Pharma;Geo=NonUs; Pharma NonUs NaN NaN

I sugest, to fill the columns one by one instead of using split, something like the followin code:
col = ['BusSeg', 'Geo', 'ProdServ', 'Sub'] # Columns Names.
var = ['BusSeg', 'Geo', 'Prd', 'Subseg'] # Variables Name in 'Subseg' column.
for c, v in zip(col, var):
df1[c] = df1['segments'].str.extract(r'' + v + '=(\w*);')

Here's a suggestion:
df1.segments = (df1.segments.str.split(';')
.apply(lambda s:
dict(t.split('=') for t in s if t.strip() != '')))
df2 = pd.DataFrame({col: [dict_.get(col, '') for dict_ in df1.segments]
for col in set().union(*df1.segments)},
index=df1.index)
df1.drop(columns=['segments'], inplace=True)
df1 = pd.concat([df1, df2], axis='columns')
Result:
company clv date line Subseg Geo BusSeg Prd
0 Rev 500 20191231 1 Tr1 NonUs Pharma Alpha
1 Rev 200 20191231 3 Tr1 Dev Alpha
2 Rev 3000 20191231 2 Tr2 US Pharma Alpha
3 Rev 400 20181231 1 Tr1
4 Rev 10 20181231 3 Pharma
5 Rev 300 20181231 2 Tr4 China Alpha
6 Rev 560 20171231 1 Tr1 Beta
7 Rev 500 20171231 3 Tr1 US Pharma Delta
8 Rev 600 20171231 2 NonUs Pharma

How to aggregate stats in one dataframe based on filtering values in another dataframe?

I have 2 dataframes. rdf is the reference dataframe I am trying to use to define the interval (top and bottom) to calculate an average between (all of the depths between this interval), but use ldf to actually run that calculation since it contains the values. rdf defines the top and bottom for each id number an average should be run for. There are multiple intervals for each id.
rdf is formatted as such:
ID Top Bottom
1 2010 3000
1 4300 4500
1 4550 5000
1 7100 7700
2 3200 4100
2 4120 4180
2 4300 5300
2 5500 5520
3 2300 2380
3 3200 4500
ldf is fromated as such:
ID Depth(ft) Value1 Value2 Value3
1 2000 45 .32 423
1 2000.5 43 .33 500
1 2001 40 .12 643
1 2001.5 28 .10 20
1 2002 40 .10 34
1 2002.5 23 .11 60
1 2003 34 .08 900
1 2003.5 54 .04 1002
2 2000 40 .28 560
2 2000 38 .25 654
...
3 2000 43 .30 343
I want to use rdf to define the top and bottom of the interval to calculate the average for Value1, Value2, and Value3. I would also like to have a count documented as well (not all of the values between the intervals necessarily exist, so it could be less than the difference of Bottom - Top). This will then modify rdf to make a new file:
new_rdf is formatted as such:
ID Top Bottom avgValue1 avgValue2 avgValue3 ThicknessCount(ft)
1 2010 3000 54 .14 456 74
1 4300 4500 23 .18 632 124
1 4550 5000 34 .24 780 111
1 7100 7700 54 .19 932 322
2 3200 4100 52 .32 134 532
2 4120 4180 16 .11 111 32
2 4300 5300 63 .29 872 873
2 5500 5520 33 .27 1111 9
3 2300 2380 63 .13 1442 32
3 3200 4500 37 .14 1839 87
I've been going back and forth on the best way to do this. I tried mimicking this time series example: Sum set of values from pandas dataframe within certain time frame
but it doesn't seem translatable:
import pandas as pd
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']
def get_depths(x):
n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
return n['ID'].values[0],n['DEPTH'].sum()
test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)
I get "TypeError: Invalid comparison between dtype=float64 and str"
And it works if I use the samples they made in the post, but it doesn't work with my data. I'm also hoping there's a more, simple way to do this.

Edit # 2A:
Note:
Sample DataFrame below is not exactly the same as posted in question
Posting a new code here that does uses Top and Bottom from rdf to check for DEPTH in ldf to calculate .mean() for each group using for-loop. A range_key is created in rdf that is unique to each row, assuming that the DataFrame rdf does not have any duplicates.
# Import libraries
import pandas as pd
# Create DataFrame
rdf = pd.DataFrame({
'ID': [1,1,1,1,2,2,2,2,3,3],
'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
'ID': [1,1,1,1,1,1,1,1,2,2,3],
'DEPTH': [2000,2000.5,2001,2001.5,4002,4002.5,5003,5003.5,2000,2000,2000],
'Value1':[45,43,40,28,40,23,34,54,40,38,43],
'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})
# Create a key for merge later
ldf['range_key'] = np.nan
rdf['range_key'] = np.linspace(1,rdf.shape[0],rdf.shape[0]).astype(int).astype(str)
# Flag each row for a range
for i in range(ldf.shape[0]):
for j in range(rdf.shape[0]):
d = ldf['DEPTH'][i]
if (d>= rdf['Top'][j]) & (d<=rdf['Bottom'][j]):
rkey = rdf['range_key'][j]
ldf['range_key'][i]=rkey
break;
ldf['range_key'] = ldf['range_key'].astype(int).astype(str) # Convert to string
# Calculate mean for groups
ldf_mean = ldf.groupby(['ID','range_key']).mean().reset_index()
ldf_mean = ldf_mean.drop(['DEPTH'], axis=1)
# Merge into 'rdf'
new_rdf = rdf.merge(ldf_mean, on=['ID','range_key'], how='left')
new_rdf = new_rdf.drop(['range_key'], axis=1)
new_rdf
Output:
ID Top Bottom Value1 Value2 Value3
0 1 2000 2500 39.0 0.2175 396.5
1 1 4300 4500 NaN NaN NaN
2 1 4500 5000 NaN NaN NaN
3 1 7100 7700 NaN NaN NaN
4 2 3200 4100 NaN NaN NaN
5 2 4120 4180 NaN NaN NaN
6 2 4300 5300 NaN NaN NaN
7 2 5500 5520 NaN NaN NaN
8 3 2300 2380 NaN NaN NaN
9 3 3200 4500 NaN NaN NaN
Edit # 1:
Code below seems to work. Added an if-statement to the return from the code posted in question above. Not sure if this is what you were looking to get. It calculates the .sum(). The first value in rdf is changed to a lower the range to match the data in ldf.
# Import libraries
import pandas as pd
# Create DataFrame
rdf = pd.DataFrame({
'ID': [1,1,1,1,2,2,2,2,3,3],
'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
'ID': [1,1,1,1,1,1,1,1,2,2,3],
'DEPTH': [2000,2000.5,2001,2001.5,2002,2002.5,2003,2003.5,2000,2000,2000],
'Value1':[45,43,40,28,40,23,34,54,40,38,43],
'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})
##### Code from the question (copy-pasted here)
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']
def get_depths(x):
n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
if (n.shape[0]>0):
return n['ID'].values[0],n['DEPTH'].sum()
test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)
Output:
test
top bottom ID Value1
0 2000 2500 1.0 14014.0
1 4300 4500 NaN NaN
2 4500 5000 NaN NaN
3 7100 7700 NaN NaN
4 3200 4100 NaN NaN
5 4120 4180 NaN NaN
6 4300 5300 NaN NaN
7 5500 5520 NaN NaN
8 2300 2380 NaN NaN
9 3200 4500 NaN NaN

Sample data and imports
import pandas
import numpy
import random
# dfr
rdata = {'ID': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3],
'Top': [2010, 4300, 4550, 7100, 3200, 4120, 4300, 5500, 2300, 3200],
'Bottom': [3000, 4500, 5000, 7700, 4100, 4180, 5300, 5520, 2380, 4500]}
dfr = pd.DataFrame(rdata)
# display(dfr.head())
ID Top Bottom
0 1 2010 3000
1 1 4300 4500
2 1 4550 5000
3 1 7100 7700
4 2 3200 4100
# df
np.random.seed(365)
random.seed(365)
rows = 10000
data = {'id': [random.choice([1, 2, 3]) for _ in range(rows)],
'depth': [np.random.randint(2000, 8000) for _ in range(rows)],
'v1': [np.random.randint(40, 50) for _ in range(rows)],
'v2': np.random.rand(rows),
'v3': [np.random.randint(20, 1000) for _ in range(rows)]}
df = pd.DataFrame(data)
df.sort_values(['id', 'depth'], inplace=True)
df.reset_index(drop=True, inplace=True)
# display(df.head())
id depth v1 v2 v3
0 1 2004 48 0.517014 292
1 1 2004 41 0.997347 859
2 1 2006 42 0.278217 851
3 1 2006 49 0.570363 32
4 1 2009 43 0.462985 409
Use each row of dfr to filter and extract stats from df
There are plenty of answers on SO dealing with "TypeError: Invalid comparison between dtype=float64 and str". The numeric columns need to be cleaned of any value that can't be converted to a numeric type.
This code deals with using one dataframe to filter and return metrics for another dataframe.
For each row in dfr:
Filter df
Aggregate the mean and count for v1, v2 and v3
.T to transpose the mean and count rows to columns
Convert to a numpy array
Slice the array for the 3 means and append the array to the v_mean
Slice the array for the max count and append the value to count
They could be all the same, if there are no NaNs in the data
Convert the list of arrays, v_mean to a dataframe, and join it to dfr_new
Add counts a column in dfr_new
v_mean = list()
counts = list()
for idx, (i, t, b) in dfr.iterrows(): # iterate through each row of dfr
data = df[['v1', 'v2', 'v3']][(df.id == i) & (df.depth >= t) & (df.depth <= b)].agg(['mean', 'count']).T.to_numpy() # apply filters and get stats
v_mean.append(data[:, 0]) # get the 3 means
counts.append(data[:, 1].max()) # get the max of the 3 counts; each column has a count, the count cound be different if there are NaNs in data
# copy dfr to dfr_new
dfr_new = dfr.copy()
# add stats values
dfr_new = dfr_new.join(pd.DataFrame(v_mean, columns=['v1_m', 'v2_m', 'v3_m']))
dfr_new['counts'] = counts
# display(dfr_new)
ID Top Bottom v1_mean v2_mean v3_mean count
0 1 2010 3000 44.577491 0.496768 502.068266 542.0
1 1 4300 4500 44.555556 0.518066 530.968254 126.0
2 1 4550 5000 44.446281 0.538855 482.818182 242.0
3 1 7100 7700 44.348083 0.489983 506.681416 339.0
4 2 3200 4100 44.804040 0.487011 528.707071 495.0
5 2 4120 4180 45.096774 0.526687 520.967742 31.0
6 2 4300 5300 44.476980 0.529476 523.095764 543.0
7 2 5500 5520 46.000000 0.608876 430.500000 12.0
8 3 2300 2380 44.512195 0.456632 443.195122 41.0
9 3 3200 4500 44.554755 0.516616 501.841499 694.0

Pandas how to preserve all values in dataframe into a csv?

I want to convert the html to csv using pandas functions
This is a part of what I read in the dataframe df
0 1
0 sequence 2
1 trainNo K805
2 trainNumber K805
3 departStation 鹰潭
4 departStationPy yingtan
5 arriveStation 南昌
6 arriveStationPy nanchang
7 departDate 2020-05-24
8 departTime 03:55
9 arriveDate 2020-05-24
10 arriveTime 05:44
11 isStartStation False
12 isEndStation False
13 runTime 1小时49分钟
14 preSaleTime NaN
15 takeDays 0
16 isBookable True
17 seatList seatNamepriceorderPriceinventoryisBookablebutt...
18 curSeatIndex 0
seatName price orderPrice inventory isBookable buttonDisplayName buttonType
0 硬座 23.5 23.5 99 True NaN 0
1 硬卧 69.5 69.5 99 True NaN 0
2 软卧 104.5 104.5 4 True NaN 0
0 1
0 departDate 2020-05-23
1 departStationList NaN
2 endStationList NaN
3 departStationFilterMap NaN
4 endStationFilterMap NaN
5 departCityName 上海
6 arriveCityName 南昌
7 gtMinPrice NaN
My code is like this
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(".\other.csv",index=True,encoding='utf-8-sig')
To preserve the characters in csv, I need to use utf-8-sig encoding. But I don't know how to write the format symbol %
,0,1
0,departDate,2020-05-23
1,departStationList,
2,endStationList,
3,departStationFilterMap,
4,endStationFilterMap,
5,departCityName,上海
6,arriveCityName,南昌
7,gtMinPrice,
This is what I got in csv file, only the last part is preserved.
The dataframe is correct, while the csv need correction. Can you show me how to make the correct output?

you're saving each dataframe to the same file, so each is getting overwritten until the last one.
note the addition of the f-string to change the save file name e.g. f".\other_{i}.csv"
each dataframe is a different shape, so they won't all fit together properly
To CSV
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_csv(f".\other_{i}.csv", index=True, encoding='utf-8-sig')
To Excel
with pd.ExcelWriter('output.xlsx', mode='w') as writer:
for i, df in enumerate(pd.read_html(html,encoding='utf-8')):
df.to_excel(writer, sheet_name=f'Sheet{i}', encoding='utf-8-sig')

Is there a possibility to use a bigger List in phython?

For school I have to make a project about wifisignals and I am trying put the data in a dataframe.
There are 208.000 rows of data.
And when it comes to the code below, the code does not complete. The code is like it is stuck in an infinite loop.
But when I use only a 1000 rows my program works. So I think that my list are to small if that is possible.
Do bigger Lists exist in phython? Or is it because I use bad coding?
Thanks in advance.
edit 1:
(data is the original dataframe and wifiinfo is a column of that)
i have this format:
df = pd.DataFrame(columns=['Sender','Time','Date','Place','X','Y','Bezetting','SSID','BSSID','Signal'])
And i am trying to fill SSID, BSSID and Signal from the Column WifiInfo for this i have to split the data.
this is how 1 WifiInfo looks like:
ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88-1d-fc-2c-c0-00:-72,ODISEE#88-1d-fc-41-d2-d0:-82,CiscoC5976#58-6d-8f-19-14-38:-78,CiscoC5959#58-6d-8f-19-13-f4:-93,SNB#c8-d7-19-6f-be-b7:-99,ODISEE#88-1d-fc-2c-c5-70:-94,HackingDemo#58-6d-8f-19-11-48:-156,ODISEE#88-1d-fc-30-d4-40:-85,ODISEE#88-1d-fc-41-ac-50:-100
My current approach looks like:
for index, row in data.iterrows():
bezettingList = list()
ssidList = list()
bssidList = list()
signalList = list()
#WifiInfo splitting
wifis = row.WifiInfo.split(',')
for wifi in wifis:
#split wifi and add to List
ssid, bssid = wifi.split('#')
bssid, signal = bssid.split(':')
ssidList.append(ssid)
bssidList.append(bssid)
signalList.append(int(signal))
#add bezettingen to List
bezettingen = row.Bezetting.split(',')
for bezetting in bezettingen:
bezettingList.append(bezetting)
#add list to dataframe
df.loc[index,'SSID'] = ssidList
df.loc[index,'BSSID'] = bssidList
df.loc[index,'Signal'] = signalList
df.loc[index,'Bezetting'] = bezettingList
df.head()

IIUC, you need to first explode the row by commas so that this:
SSID BSSID Signal WifiInfo
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88- ...
becomes this:
SSID BSSID Signal WifiInfo
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83
1 NaN NaN NaN ODISEE#88-1d-fc-2c-c0-00:-72
2 NaN NaN NaN ODISEE#88-1d-fc-41-d2-d0:-82
3 NaN NaN NaN CiscoC5976#58-6d-8f-19-14-38:-78
4 NaN NaN NaN CiscoC5959#58-6d-8f-19-13-f4:-93
5 NaN NaN NaN SNB#c8-d7-19-6f-be-b7:-99
6 NaN NaN NaN ODISEE#88-1d-fc-2c-c5-70:-94
7 NaN NaN NaN HackingDemo#58-6d-8f-19-11-48:-156
8 NaN NaN NaN ODISEE#88-1d-fc-30-d4-40:-85
9 NaN NaN NaN ODISEE#88-1d-fc-41-ac-50:-100
# use `.explode`
data = data.assign(WifiInfo=data.WifiInfo.str.split(',')).explode('WifiInfo')
Now you could use .str.extract:
data['SSID'] = data['WifiInfo'].str.extract(r'(.*)#')
data['BSSID'] = data['WifiInfo'].str.extract(r'#(.*):')
data['Signal'] = data['WifiInfo'].str.extract(r':(.*)')
SSID BSSID Signal WifiInfo
0 ODISEE 88-1d-fc-41-dc-50 -83 ODISEE#88-1d-fc-41-dc-50:-83
1 ODISEE 88-1d-fc-2c-c0-00 -72 ODISEE#88-1d-fc-2c-c0-00:-72
2 ODISEE 88-1d-fc-41-d2-d0 -82 ODISEE#88-1d-fc-41-d2-d0:-82
3 CiscoC5976 58-6d-8f-19-14-38 -78 CiscoC5976#58-6d-8f-19-14-38:-78
4 CiscoC5959 58-6d-8f-19-13-f4 -93 CiscoC5959#58-6d-8f-19-13-f4:-93
5 SNB c8-d7-19-6f-be-b7 -99 SNB#c8-d7-19-6f-be-b7:-99
6 ODISEE 88-1d-fc-2c-c5-70 -94 ODISEE#88-1d-fc-2c-c5-70:-94
7 HackingDemo 58-6d-8f-19-11-48 -156 HackingDemo#58-6d-8f-19-11-48:-156
8 ODISEE 88-1d-fc-30-d4-40 -85 ODISEE#88-1d-fc-30-d4-40:-85
9 ODISEE 88-1d-fc-41-ac-50 -100 ODISEE#88-1d-fc-41-ac-50:-100
If you want to keep data grouped after column explosion, I'd assign an ID for each group of entries first:
data['Group'] = pd.factorize(data['WifiInfo'])[0]+1
SSID BSSID Signal WifiInfo Group
0 NaN NaN NaN ODISEE#88-1d-fc-41-dc-50:-83,ODISEE#88- ... 1
1 NaN NaN NaN ASD#22-1d-fc-41-dc-50:-83,QWERTY#88- ... 2
# after you explode the column
SSID BSSID Signal WifiInfo Group
ODISEE 88-1d-fc-41-dc-50 -83 ODISEE#88-1d-fc-41-dc-50:-83 1
ODISEE 88-1d-fc-2c-c0-00 -72 ODISEE#88-1d-fc-2c-c0-00:-72 1
...
...
ASD 22-1d-fc-41-dc-50 -83 ASD#88-1d-fc-41-dc-50:-83 2
QWERTY 88-1d-fc-2c-c0-00 -72 QWERTY#88-1d-fc-2c-c0-00:-72 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unwanted white spaces resulting into distorted column - python

Related

How to split a dataframe containing voltage over time value, so that it can store values of each waveform/bit separately

Expanding a column of pandas dataframe with variable number of elements and leading texts

How to aggregate stats in one dataframe based on filtering values in another dataframe?

Pandas how to preserve all values in dataframe into a csv?

Is there a possibility to use a bigger List in phython?

Categories

Resources