Pandas: clean & convert DataFrame to numbers - python

I have a dataframe containing strings, as read from a sloppy csv:
id Total B C ...
0 56 974 20 739 34 482
1 29 479 10 253 16 704
2 86 961 29 837 43 593
3 52 687 22 921 28 299
4 23 794 7 646 15 600
What I want to do: convert every cell in the frame into a number. It should be ignoring whitespaces, but put NaN where the cell contains something really strange.
I probably know how to do it using terribly unperformant manual looping and replacing values, but was wondering if there's a nice and clean why to do this.

You can use read_csv with regex separator \s{2,} - 2 or more whitespaces and parameter thousands:
import pandas as pd
from pandas.compat import StringIO
temp=u"""id Total B C
0 56 974 20 739 34 482
1 29 479 10 253 16 704
2 86 961 29 837 43 593
3 52 687 22 921 28 299
4 23 794 7 646 15 600 """
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="\s{2,}", engine='python', thousands=' ')
print (df)
id Total B C
0 0 56974 20739 34482
1 1 29479 10253 16704
2 2 86961 29837 43593
3 3 52687 22921 28299
4 4 23794 7646 15600
print (df.dtypes)
id int64
Total int64
B int64
C int64
dtype: object
And then if necessary apply function to_numeric with parameter errors='coerce' - it replace non numeric to NaN:
df = df.apply(pd.to_numeric, errors='coerce')

Related

How to use groupby with nan value in groupby column

I have the following Dataframe:
Original Dataframe
I want the following output:
output Dataframe
I have tried using groupby on "Container" column (and sum and other columns) but it only gives the first row as output.
I am very new to python and pandas. and not sure if am doing it correct.
Some of the answer of slimier questions are too advanced for me to understand.
I am just wondering if i can get the output with just 2/3 lines of coding.
Expected result exactly as the one you showed as "Output Dataframe": first "NaN" values in "Container" column of your Original Dataframe must replaced with the corresponding immediate upper value. I added more "NaN" values to exemplify:
Original DataFrame:
df
Container SB No Pkgs CBM Weight
257 CXRU1219452 195375 1650 65 23000
259 BEAU4883430 140801 26 3 575
260 NaN 140868 60 8 1153
261 NaN 140824 11 1 197
262 NaN 140851 253 32 4793
263 NaN 140645 14 1 278
264 NaN 140723 5 0 71
265 NaN 140741 1 0 22
266 NaN 140768 5 0 93
268 SZLU9366565 189355 1800 65 23000
259 ZBCD1234567 100000 100 10 1000
260 NaN 100000 100 10 1000
261 NaN 100000 100 10 1000
262 NaN 100000 100 10 1000
Use "fillna" function with method "ffill" as suggested by [https://stackoverflow.com/a/27905350/6057650][1]
Then you will get "Container" column without "NaN" values:
df=df.fillna(method='ffill')
df
Container SB No Pkgs CBM Weight
257 CXRU1219452 195375 1650 65 23000
259 BEAU4883430 140801 26 3 575
260 BEAU4883430 140868 60 8 1153
261 BEAU4883430 140824 11 1 197
262 BEAU4883430 140851 253 32 4793
263 BEAU4883430 140645 14 1 278
264 BEAU4883430 140723 5 0 71
265 BEAU4883430 140741 1 0 22
266 BEAU4883430 140768 5 0 93
268 SZLU9366565 189355 1800 65 23000
259 ZBCD1234567 100000 100 10 1000
260 ZBCD1234567 100000 100 10 1000
261 ZBCD1234567 100000 100 10 1000
262 ZBCD1234567 100000 100 10 1000
Now you can get the expected "Output DataFrame" using groupby:
df.groupby(['Container']).sum()
SB No Pkgs CBM Weight
Container
BEAU4883430 1126221 375 45 7182
CXRU1219452 195375 1650 65 23000
SZLU9366565 189355 1800 65 23000
ZBCD1234567 400000 400 40 4000
I believe you could groupby and sum like below. The dropna will drop the NaN/empty values in your DataFrame.
df.dropna().groupby(['Container']).sum()
import pandas as pd
d = [['CXRU',195, 1650,65,23000],
['BEAU',140, 26, 3, 575],
['NaN', 140, 60 , 8, 1153]]
df=pd.DataFrame(mylist,columns=['Container','SB No', 'Pkgs', 'CBM','Weight'])
df
sel= df['Container']!='NaN'
df[sel]
import pandas as pd
df = pd.DataFrame({'id':['aaa', 'aaa', 'bbb', 'ccc', 'bbb', 'NaN', 'NaN', 'aaa', 'NaN'],
'values':[1,2,3,4,5,6,7,8,9]})
df
for i in range(len(df)):
if df.iloc[i,0] == "NaN":
df.iloc[i,0] = df.iloc[i-1,0]
df.groupby('id').sum()

Read 4 lines of data into one row of pandas data frame

I have txt file with such values:
108,612,620,900
168,960,680,1248
312,264,768,564
516,1332,888,1596
I need to read all of this into a single row of data frame.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
I have many such files and so I'll keep appending rows to this data frame.
I believe we need some kind of regex but I'm not able to figure it out. For now this is what I have :
df = pd.read_csv(f,sep=",| ", header = None)
But this takes , and (space) as separators where as I want it to take newline as a separator.
First, read the data:
df = pd.read_csv('test/t.txt', header=None)
It gives you a DataFrame shaped like the CSV. Then concatenate:
s = pd.concat((df.loc[i] for i in df.index), ignore_index=True)
It gives you a Series:
0 108
1 612
2 620
3 900
4 168
5 960
6 680
7 1248
8 312
9 264
10 768
11 564
12 516
13 1332
14 888
15 1596
dtype: int64
Finally, if you really want a horizontal DataFrame:
pd.DataFrame([s])
Gives you:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 108 612 620 900 168 960 680 1248 312 264 768 564 516 1332 888 1596
Since you've mentioned in a comment that you have many such files, you should simply store all the Series in a list, and construct a DataFrame with all of them at once when you're finished loading them all.

Python pandas idxmax for multiple indexes in a dataframe

I have a series that looks like this:
delivery
2007-04-26 706 23
2007-04-27 705 10
706 1089
708 83
710 13
712 51
802 4
806 1
812 3
2007-04-29 706 39
708 4
712 1
2007-04-30 705 3
706 1016
707 2
...
2014-11-04 1412 53
1501 1
1502 1
1512 1
2014-11-05 1411 47
1412 1334
1501 40
1502 433
1504 126
1506 100
1508 7
1510 6
1512 51
1604 1
1612 5
Length: 26255, dtype: int64
where the query is: df.groupby([df.index.date, 'delivery']).size()
For each day, I need to pull out the delivery number which has the most volume. I feel like it would be something like:
df.groupby([df.index.date, 'delivery']).size().idxmax(axis=1)
However, this just returns me the idxmax for the entire dataframe; instead, I need the second-level idmax (not the date but rather the delivery number) for each day, not the entire dataframe (ie. it returns a vector).
Any ideas on how to accomplish this?
Your example code doesn't work because the idxmax is executed after the groupby operation (so on the whole dataframe)
I'm not sure how to use idxmax on multilevel indexes, so here's a simple workaround.
Setting up data :
import pandas as pd
d= {'Date': ['2007-04-26', '2007-04-27', '2007-04-27', '2007-04-27',
'2007-04-27', '2007-04-28', '2007-04-28'],
'DeliveryNb': [706, 705, 708, 450, 283, 45, 89],
'DeliveryCount': [23, 10, 1089, 82, 34, 100, 11]}
df = pd.DataFrame.from_dict(d, orient='columns').set_index('Date')
print df
output
DeliveryCount DeliveryNb
Date
2007-04-26 23 706
2007-04-27 10 705
2007-04-27 1089 708
2007-04-27 82 450
2007-04-27 34 283
2007-04-28 100 45
2007-04-28 11 89
creating custom function :
The trick is to use the reset_index() method (so you easily get the integer index of the group)
def func(df):
idx = df.reset_index()['DeliveryCount'].idxmax()
return df['DeliveryNb'].iloc[idx]
applying it :
g = df.groupby(df.index)
g.apply(func)
result :
Date
2007-04-26 706
2007-04-27 708
2007-04-28 45
dtype: int64
Suppose you have this series:
delivery
2001-01-02 0 2
1 3
6 2
7 2
9 3
2001-01-03 3 2
6 1
7 1
8 3
9 1
dtype: int64
If you want one delivery per date with the maximum value, you could use idxmax:
dates = series.index.get_level_values(0)
series.loc[series.groupby(dates).idxmax()]
yields
delivery
2001-01-02 1 3
2001-01-03 8 3
dtype: int64
If you want all deliveries per date with the maximum value, use transform to generate a boolean mask:
mask = series.groupby(dates).transform(lambda x: x==x.max()).astype('bool')
series.loc[mask]
yields
delivery
2001-01-02 1 3
9 3
2001-01-03 8 3
dtype: int64
This is the code I used to generate series:
import pandas as pd
import numpy as np
np.random.seed(1)
N = 20
rng = pd.date_range('2001-01-02', periods=N//2, freq='4H')
rng = np.random.choice(rng, N, replace=True)
rng.sort()
df = pd.DataFrame(np.random.randint(10, size=(N,)), columns=['delivery'], index=rng)
series = df.groupby([df.index.date, 'delivery']).size()
If you have the following dataframe (you can always reset the index if needed with : df = df.reset_index() :
Date Del_Count Del_Nb
0 1/1 14 19 <
1 11 17
2 2/2 25 29 <
3 21 27
4 22 28
5 3/3 34 36
6 37 37
7 31 39 <
To find the max per Date and extract the relevant Del_Count you can use:
df = df.ix[df.groupby(['Date'], sort=False)['Del_Nb'].idxmax()][['Date','Del_Count','Del_Nb']]
Which would yeild:
Date Del_Count Del_Nb
0 1/1 14 19
2 2/2 25 29
7 3/3 31 39

Issue with reindexing a multiindex

I am struggling to reindex a multiindex. Example code below:
rng = pd.date_range('01/01/2000 00:00', '31/12/2004 23:00', freq='H')
ts = pd.Series([h.dayofyear for h in rng], index=rng)
daygrouped = ts.groupby(lambda x: x.dayofyear)
daymean = daygrouped.mean()
myindex = np.arange(1,367)
myindex = np.concatenate((myindex[183:],myindex[:183]))
daymean.reindex(myindex)
gives (as expected):
184 184
185 185
186 186
187 187
...
180 180
181 181
182 182
183 183
Length: 366, dtype: int64
BUT if I create a multindex:
hourgrouped = ts.groupby([lambda x: x.dayofyear, lambda x: x.hour])
hourmean = hourgrouped.mean()
myindex = np.arange(1,367)
myindex = np.concatenate((myindex[183:],myindex[:183]))
hourmean.reindex(myindex, level=1)
I get:
1 1 1
2 1
3 1
4 1
...
366 20 366
21 366
22 366
23 366
Length: 8418, dtype: int64
Any ideas on my mistake? - Thanks.
Bevan
First, you have to specify level=0 instead of 1 (as it is the first level -> zero-based indexing -> 0).
But, there is still a problem: the reindexing works, but does not seem to preserve the order of the provided index in the case of a MultiIndex:
In [54]: hourmean.reindex([5,4], level=0)
Out[54]:
4 0 4
1 4
2 4
3 4
4 4
...
20 4
21 4
22 4
23 4
5 0 5
1 5
2 5
3 5
4 5
...
20 5
21 5
22 5
23 5
dtype: int64
So getting a new subset of the index works, but it is in the same order as the original and not as the new provided index.
This is possibly a bug with reindex on a certain level (I opened an issue to discuss this: https://github.com/pydata/pandas/issues/8241)
A solution for now to reindex your series, is to create a MultiIndex and reindex with that (so not on a specified level, but with the full index, that does preserve the order). Doing this is very easy with MultiIndex.from_product as you already have myindex:
In [79]: myindex2 = pd.MultiIndex.from_product([myindex, range(24)])
In [82]: hourmean.reindex(myindex2)
Out[82]:
184 0 184
1 184
2 184
3 184
4 184
5 184
6 184
7 184
8 184
9 184
10 184
11 184
12 184
13 184
14 184
...
183 9 183
10 183
11 183
12 183
13 183
14 183
15 183
16 183
17 183
18 183
19 183
20 183
21 183
22 183
23 183
Length: 8784, dtype: int64

Match rows in one Pandas dataframe to another based on three columns

I have two Pandas dataframes, one quite large (30000+ rows) and one a lot smaller (100+ rows).
The dfA looks something like:
X Y ONSET_TIME COLOUR
0 104 78 1083 6
1 172 78 1083 16
2 240 78 1083 15
3 308 78 1083 8
4 376 78 1083 8
5 444 78 1083 14
6 512 78 1083 14
... ... ... ... ...
The dfB looks something like:
TIME X Y
0 7 512 350
1 1722 512 214
2 1906 376 214
3 2095 376 146
4 2234 308 78
5 2406 172 146
... ... ... ...
What I want to do is for every row in dfB to find the row in dfA where the values of the X AND Y columns are equal AND that is the first row where the value of dfB['TIME'] is greater than dfA['ONSET_TIME'] and return the value of dfA['COLOUR'] for this row.
dfA represents refreshes of a display, where X and Y are coordinates of items on the display and so repeat themselves for every different ONSET_TIME (there are 108 pairs of coodinates for each value of ONSET_TIME).
There will be multiple rows where the X and Y in the two dataframes are equal, but I need the one that matches the time too.
I have done this using for loops and if statements just to see that it could be done, but obviously given the size of the dataframes this takes a very long time.
for s in range(0, len(dfA)):
for r in range(0, len(dfB)):
if (dfB.iloc[r,1] == dfA.iloc[s,0]) and (dfB.iloc[r,2] == dfA.iloc[s,1]) and (dfA.iloc[s,2] <= dfB.iloc[r,0] < dfA.iloc[s+108,2]):
return dfA.iloc[s,3]
There is probably an even more efficient way to do this, but here is a method without those slow for loops:
import pandas as pd
dfB = pd.DataFrame({'X':[1,2,3],'Y':[1,2,3], 'Time':[10,20,30]})
dfA = pd.DataFrame({'X':[1,1,2,2,2,3],'Y':[1,1,2,2,2,3], 'ONSET_TIME':[5,7,9,16,22,28],'COLOR': ['Red','Blue','Blue','red','Green','Orange']})
#create one single table
mergeDf = pd.merge(dfA, dfB, left_on = ['X','Y'], right_on = ['X','Y'])
#remove rows where time is less than onset time
filteredDf = mergeDf[mergeDf['ONSET_TIME'] < mergeDf['Time']]
#take min time (closest to onset time)
groupedDf = filteredDf.groupby(['X','Y']).max()
print filteredDf
COLOR ONSET_TIME X Y Time
0 Red 5 1 1 10
1 Blue 7 1 1 10
2 Blue 9 2 2 20
3 red 16 2 2 20
5 Orange 28 3 3 30
print groupedDf
COLOR ONSET_TIME Time
X Y
1 1 Red 7 10
2 2 red 16 20
3 3 Orange 28 30
The basic idea is to merge the two tables so you have the times together in one table. Then I filtered on the recs that are the largest (closest to the time on your dfB). Let me know if you have questions about this.
Use merge() - it works like JOIN in SQL - and you have first part done.
d1 = ''' X Y ONSET_TIME COLOUR
104 78 1083 6
172 78 1083 16
240 78 1083 15
308 78 1083 8
376 78 1083 8
444 78 1083 14
512 78 1083 14
308 78 3000 14
308 78 2000 14'''
d2 = ''' TIME X Y
7 512 350
1722 512 214
1906 376 214
2095 376 146
2234 308 78
2406 172 146'''
import pandas as pd
from StringIO import StringIO
dfA = pd.DataFrame.from_csv(StringIO(d1), sep='\s+', index_col=None)
#print dfA
dfB = pd.DataFrame.from_csv(StringIO(d2), sep='\s+', index_col=None)
#print dfB
df1 = pd.merge(dfA, dfB, on=['X','Y'])
print df1
result:
X Y ONSET_TIME COLOUR TIME
0 308 78 1083 8 2234
1 308 78 3000 14 2234
2 308 78 2000 14 2234
Then you can use it to filter results.
df2 = df1[ df1['ONSET_TIME'] < df1['TIME'] ]
print df2
result:
X Y ONSET_TIME COLOUR TIME
0 308 78 1083 8 2234
2 308 78 2000 14 2234

Categories

Resources