How to label same pandas dataframe rows? - python

I have a large pandas dataframe like this:
log apple watermelon orange lemon grapes
1 1 1 yes 0 0
1 2 0 1 0 0
1 True 0 0 0 2
2 0 0 0 0 2
2 1 1 yes 0 0
2 0 0 0 0 2
2 0 0 0 0 2
3 True 0 0 0 2
4 0 0 0 0 2.1
4 0 0 0 0 2.1
How can I label the rows that are the same, for example:
log apple watermelon orange lemon grapes ID
1 1 1 yes 0 0 1
1 2 0 1 0 0 2
1 True 0 0 0 2 3
2 0 0 0 0 2 4
2 1 1 yes 0 0 1
2 0 0 0 0 2 4
2 0 0 0 0 2 4
3 True 0 0 0 2 3
4 0 0 0 0 2.1 5
4 0 0 0 0 2.1 5
I tried to:
df['ID']=df.groupby('log')[df.columns].transform('ID')
And
df['personid'] = df['log'].clip_upper(2) - 2*d.duplicated(subset='apple')
df
However, the above doesnt work because I literally have a lot of columns.
But its not giving me the expected output. Any idea of how to group and label this dataframe?

Given
x = io.StringIO("""log apple watermelon orange lemon grapes
1 1 1 yes 0 0
1 2 0 1 0 0
1 True 0 0 0 2
2 0 0 0 0 2
2 1 1 yes 0 0
2 0 0 0 0 2
2 0 0 0 0 2
3 True 0 0 0 2
4 0 0 0 0 2.1
4 0 0 0 0 2.1""")
df2 = pd.read_table(x, delim_whitespace=True)
You can first use transform with tuple to make each row hashable and comparable, and then play with indexes and range to create unique ids
f = df2.transform(tuple,1).to_frame()
k = f.groupby(0).sum()
k['id'] = range(1,len(k.index)+1)
And finally
df2['temp_key'] = f[0]
df2 = df2.set_index('temp_key')
df2['id'] = k.id
df2.reset_index().drop('temp_key', 1)
log apple watermelon orange lemon grapes id
0 1 1 1 yes 0 0.0 1
1 1 2 0 1 0 0.0 2
2 1 True 0 0 0 2.0 3
3 2 0 0 0 0 2.0 4
4 2 1 1 yes 0 0.0 5
5 2 0 0 0 0 2.0 4
6 2 0 0 0 0 2.0 4
7 3 True 0 0 0 2.0 6
8 4 0 0 0 0 2.1 7
9 4 0 0 0 0 2.1 7

Related

is there any way to convert the columns in Pandas Dataframe using its mirror image Dataframe structure

the df I have is :
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I wanted to obtain a Dataframe with columns reversed/mirror image :
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Is there any way to do that
You can check
df[:] = df.iloc[:,::-1]
df
Out[959]:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Here is a bit more verbose, but likely more efficient solution as it doesn't require to rewrite the data. It only renames and reorders the columns:
cols = df.columns
df.columns = df.columns[::-1]
df = df.loc[:,cols]
Or shorter variant:
df = df.iloc[:,::-1].set_axis(df.columns, axis=1)
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
There are other ways, but here's one solution:
df[df.columns] = df[reversed(df.columns)]
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1

Count how many cells are between the last value in the dataframe and the end of the row

I'm using the pandas library in Python.
I have a data frame:
0 1 2 3 4
0 0 0 0 1 0
1 0 0 0 0 1
2 0 0 1 0 0
3 1 0 0 0 0
4 0 0 1 0 0
5 0 1 0 0 0
6 1 0 0 1 1
Is it possible to create a new column that is a count of the number of cells that are empty between the end of the row and the last value above zero? Example data frame below:
0 1 2 3 4 Value
0 0 0 0 1 0 1
1 0 0 0 0 1 0
2 0 0 1 0 0 2
3 1 0 0 0 0 4
4 0 0 1 0 0 2
5 0 1 0 0 0 3
6 1 0 0 1 1 0
using argmax
df['value'] = df.apply(lambda x: (x.iloc[::-1] == 1).argmax(),1)
##OR
using np.where
df['Value'] = np.where(df.iloc[:,::-1] == 1,True,False).argmax(1)
0 1 2 3 4 Value
0 0 0 0 1 0 1
1 0 0 0 0 1 0
2 0 0 1 0 0 2
3 1 0 0 0 0 4
4 0 0 1 0 0 2
5 0 1 0 0 0 3
6 1 0 0 1 1 0
Use:
df['new'] = df.iloc[:, ::-1].cumsum(axis=1).eq(0).sum(axis=1)
print (df)
0 1 2 3 4 new
0 0 0 0 1 0 1
1 0 0 0 0 1 0
2 0 0 1 0 0 2
3 1 0 0 0 0 4
4 0 0 1 0 0 2
5 0 1 0 0 0 3
6 1 0 0 1 1 0
Details:
First change order of columns by DataFrame.loc and slicing:
print (df.iloc[:, ::-1])
4 3 2 1 0
0 0 1 0 0 0
1 1 0 0 0 0
2 0 0 1 0 0
3 0 0 0 0 1
4 0 0 1 0 0
5 0 0 0 1 0
6 1 1 0 0 1
Then use cumulative sum per rows by DataFrame.cumsum:
print (df.iloc[:, ::-1].cumsum(axis=1))
4 3 2 1 0
0 0 1 1 1 1
1 1 1 1 1 1
2 0 0 1 1 1
3 0 0 0 0 1
4 0 0 1 1 1
5 0 0 0 1 1
6 1 2 2 2 3
Compare only 1 values by DataFrame.eq:
print (df.iloc[:, ::-1].cumsum(axis=1).eq(0))
4 3 2 1 0
0 True False False False False
1 False False False False False
2 True True False False False
3 True True True True False
4 True True False False False
5 True True True False False
6 False False False False False
And last count them per rows by sum:
print (df.iloc[:, ::-1].cumsum(axis=1).eq(0).sum(axis=1))
0 1
1 0
2 2
3 4
4 2
5 3
6 0
dtype: int64

Reindex Panda Multiindex

I am trying to create a new index for a dataframe from created from a root file. I'm using uproot to bring in the file using the command
upfile_muon = uproot.open(file_prefix_muon + '.root')
tree_muon = upfile_muon['ntupler']['tree']
df_muon = tree_muon.pandas.df(['vh_sim_r','vh_sim_phi','vh_sim_z','vh_sim_tp1','vh_sim_tp2',
'vh_type','vh_station','vh_ring','vh_sim_theta'], entrystop=args.max_events)
This then creates a multiindex pandas dataframe with entries and subentries as my two indexes. I want to filter out all subentries of length 3 or less. I do that with the following command while creating vectors that slice the dataframe into the data that I need.
a = 0
bad_entries = 0
entries = []
nuindex = []
tru = 0
while(a < args.max_events):
if(df_muon.loc[(a),:].shape[0] > 3):
entries.append(a)
b = 0
while( b < df_muon.loc[(a),:].shape[0]):
nuindex.append(tru)
b = b + 1
tru = tru + 1
else:
bad_entries = bad_entries + 1
a = a + 1
df_muon = df_muon.loc[pd.IndexSlice[entries,:],:]
So now my dataframe looks like this
vh_sim_r vh_sim_phi vh_sim_z vh_sim_tp1 vh_sim_tp2 vh_type vh_station vh_ring vh_sim_theta
entry subentry
0 0 149.724701 -124.728081 793.598755 0 0 3 2 1 10.684152
1 149.236725 -124.180763 796.001221 -1 -1 3 2 1 10.618716
2 149.456131 -124.687302 796.001221 0 0 3 2 1 10.633972
3 92.405533 -126.913628 539.349976 0 0 4 1 1 9.721958
4 149.345184 -124.332527 839.810669 0 0 1 2 1 10.083608
5 176.544983 -123.978333 964.500000 0 0 2 3 1 10.372764
6 194.614502 -123.764595 1054.994995 0 0 2 4 1 10.451831
7 149.236725 -124.180763 796.001221 -1 -1 3 2 1 10.618716
8 149.456131 -124.687302 796.001221 0 0 3 2 1 10.633972
9 92.405533 -126.913628 539.349976 0 0 4 1 1 9.721958
10 149.345184 -124.332527 839.810669 0 0 1 2 1 10.083608
11 176.544983 -123.978333 964.500000 0 0 2 3 1 10.372764
12 194.614502 -123.764595 1054.994995 0 0 2 4 1 10.451831
1 0 265.027252 -3.324370 796.001221 0 0 3 2 1 18.415092
1 272.908997 -3.531896 839.903625 0 0 1 2 1 18.000479
2 299.305176 -3.531351 923.885132 0 0 1 3 1 17.950438
3 312.799255 -3.499015 964.500000 0 0 2 3 1 17.968519
4 328.321442 -3.530087 1013.620056 0 0 1 4 1 17.947645
5 181.831726 -1.668625 567.971252 0 0 3 1 1 17.752077
6 265.027252 -3.324370 796.001221 0 0 3 2 1 18.415092
7 197.739120 -2.073746 615.796265 0 0 1 1 1 17.802410
8 272.908997 -3.531896 839.903625 0 0 1 2 1 18.000479
9 299.305176 -3.531351 923.885132 0 0 1 3 1 17.950438
10 312.799255 -3.499015 964.500000 0 0 2 3 1 17.968519
11 328.321442 -3.530087 1013.620056 0 0 1 4 1 17.947645
12 356.493073 -3.441958 1065.694946 0 0 2 4 2 18.495964
2 0 204.523163 -124.065643 839.835571 0 0 1 2 1 13.686690
1 135.439163 -122.568153 567.971252 0 0 3 1 1 13.412345
2 196.380875 -123.940300 796.001221 0 0 3 2 1 13.858652
3 129.801193 -122.348656 539.349976 0 0 4 1 1 13.531607
4 224.134796 -124.194283 923.877441 0 0 1 3 1 13.636631
5 237.166031 -124.181770 964.500000 0 0 2 3 1 13.814683
6 246.809235 -124.196938 1013.871643 0 0 1 4 1 13.681540
7 259.389587 -124.164017 1054.994995 0 0 2 4 1 13.813211
8 204.523163 -124.065643 839.835571 0 0 1 2 1 13.686690
9 196.380875 -123.940300 796.001221 0 0 3 2 1 13.858652
10 129.801193 -122.348656 539.349976 0 0 4 1 1 13.531607
11 224.134796 -124.194283 923.877441 0 0 1 3 1 13.636631
12 237.166031 -124.181770 964.500000 0 0 2 3 1 13.814683
13 246.809235 -124.196938 1013.871643 0 0 1 4 1 13.681540
14 259.389587 -124.164017 1054.994995 0 0 2 4 1 13.813211
3 0 120.722900 -22.053474 615.786621 0 0 1 1 4 11.091969
1 170.635376 -23.190208 793.598755 0 0 3 2 1 12.134683
2 110.061127 -21.370941 539.349976 0 0 4 1 1 11.533570
3 164.784668 -23.263920 814.977478 0 0 1 2 1 11.430829
4 192.868652 -23.398684 948.691345 0 0 1 3 1 11.491603
5 199.817978 -23.325649 968.900024 0 0 2 3 1 11.652840
6 211.474625 -23.265354 1038.803833 0 0 1 4 1 11.506759
7 216.406830 -23.275047 1059.395020 0 0 2 4 1 11.545199
8 170.612457 -23.136520 793.598755 -1 -1 3 2 1 12.133101
5 0 179.913177 -14.877813 615.749207 0 0 1 1 1 16.287615
1 160.188034 -14.731569 565.368774 0 0 3 1 1 15.819215
2 240.671204 -15.410946 793.598755 0 0 3 2 1 16.870745
3 166.238678 -14.774992 586.454590 0 0 1 1 1 15.826117
4 241.036865 -15.400753 815.009399 0 0 1 2 1 16.475443
5 281.086792 -15.534301 948.707581 0 0 1 3 1 16.503710
6 288.768768 -15.577776 968.900024 0 0 2 3 1 16.596043
7 309.145935 -15.533208 1038.588745 0 0 1 4 1 16.576143
8 312.951233 -15.579374 1059.395020 0 0 2 4 1 16.457436
9 312.313416 -16.685022 1059.395020 -1 -1 2 4 1 16.425705
Now my goal is to find a way to change the 5 value in the entry index to a 4. I want to do this in a way that automates the process such that I can have a huge number of entries (~20,000), I can have my filter delete the unusable entries, then it renumbers all of the entries sequentially from 0 to the last unfiltered entry. I've tried all sorts of commands but I've had no luck. Is there a way to do this directly?
df_muon = (df_muon
.reset_index() # Get the multi-index back as columns
.replace({'entry': 5}, {'entry': 4}) # Replace 5 in column 'entry' with 4
.set_index(['entry', 'subentry']) # Go back to the multi-index
)

Drop columns with more than 70% zeros

I would like to know if there is a command that drop columns that has more than 70% zeros or X% zeros. like:
df = df.loc[:, df.isnull().mean() < .7]
for NaN.
Thank you !
Just change df.isnull().mean() to (df==0).mean():
df = df.loc[:, (df==0).mean() < .7]
Here's a demo:
df
Out:
0 1 2 3 4
0 1 1 1 1 0
1 1 0 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 1 1 1 1
5 1 0 0 0 0
6 0 1 0 0 0
7 0 1 1 0 0
8 1 0 0 1 0
9 0 0 0 1 0
(df==0).mean()
Out:
0 0.4
1 0.5
2 0.6
3 0.5
4 0.8
dtype: float64
df.loc[:, (df==0).mean() < .7]
Out:
0 1 2 3
0 1 1 1 1
1 1 0 0 0
2 0 1 1 0
3 1 0 0 1
4 1 1 1 1
5 1 0 0 0
6 0 1 0 0
7 0 1 1 0
8 1 0 0 1
9 0 0 0 1

Convert a list of values to a time series in python

I want to convert the foll. data:
jan_1 jan_15 feb_1 feb_15 mar_1 mar_15 apr_1 apr_15 may_1 may_15 jun_1 jun_15 jul_1 jul_15 aug_1 aug_15 sep_1 sep_15 oct_1 oct_15 nov_1 nov_15 dec_1 dec_15
0 0 0 0 0 1 1 2 2 2 2 2 2 3 3 3 3 3 0 0 0 0 0 0
into a array of length 365, where each element is repeated till the next date days e.g. 0 is repeated from january 1 to january 15...
I could do something like numpy.repeat, but that is not date aware, so would not take into account that less than 15 days happen between feb_15 and mar_1.
Any pythonic solution for this?
You can use resample:
#add last value - 31 dec by value of last column of df
df['dec_31'] = df.iloc[:,-1]
#convert to datetime - see http://strftime.org/
df.columns = pd.to_datetime(df.columns, format='%b_%d')
#transpose and resample by days
df1 = df.T.resample('d').ffill()
df1.columns = ['col']
print (df1)
col
1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
1900-01-16 0
1900-01-17 0
1900-01-18 0
1900-01-19 0
1900-01-20 0
1900-01-21 0
1900-01-22 0
1900-01-23 0
1900-01-24 0
1900-01-25 0
1900-01-26 0
1900-01-27 0
1900-01-28 0
1900-01-29 0
1900-01-30 0
..
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
1900-12-16 0
1900-12-17 0
1900-12-18 0
1900-12-19 0
1900-12-20 0
1900-12-21 0
1900-12-22 0
1900-12-23 0
1900-12-24 0
1900-12-25 0
1900-12-26 0
1900-12-27 0
1900-12-28 0
1900-12-29 0
1900-12-30 0
1900-12-31 0
[365 rows x 1 columns]
#if need serie
print (df1.col)
1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
1900-01-16 0
1900-01-17 0
1900-01-18 0
1900-01-19 0
1900-01-20 0
1900-01-21 0
1900-01-22 0
1900-01-23 0
1900-01-24 0
1900-01-25 0
1900-01-26 0
1900-01-27 0
1900-01-28 0
1900-01-29 0
1900-01-30 0
..
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
1900-12-16 0
1900-12-17 0
1900-12-18 0
1900-12-19 0
1900-12-20 0
1900-12-21 0
1900-12-22 0
1900-12-23 0
1900-12-24 0
1900-12-25 0
1900-12-26 0
1900-12-27 0
1900-12-28 0
1900-12-29 0
1900-12-30 0
1900-12-31 0
Freq: D, Name: col, dtype: int64
#transpose and convert to numpy array
print (df1.T.values)
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
IIUC you can do it this way:
In [194]: %paste
# transpose DF, rename columns
x = df.T.reset_index().rename(columns={'index':'date', 0:'val'})
# parse dates
x['date'] = pd.to_datetime(x['date'], format='%b_%d')
# group resampled DF by the month and resample(`D`) each group
result = (x.groupby(x['date'].dt.month)
.apply(lambda x: x.set_index('date').resample('1D').ffill()))
# rename index names
result.index.names = ['month','date']
## -- End pasted text --
In [212]: result
Out[212]:
val
month date
1 1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
2 1900-02-01 0
1900-02-02 0
1900-02-03 0
1900-02-04 0
1900-02-05 0
1900-02-06 0
1900-02-07 0
1900-02-08 0
1900-02-09 0
1900-02-10 0
1900-02-11 0
1900-02-12 0
1900-02-13 0
1900-02-14 0
1900-02-15 0
... ...
11 1900-11-01 0
1900-11-02 0
1900-11-03 0
1900-11-04 0
1900-11-05 0
1900-11-06 0
1900-11-07 0
1900-11-08 0
1900-11-09 0
1900-11-10 0
1900-11-11 0
1900-11-12 0
1900-11-13 0
1900-11-14 0
1900-11-15 0
12 1900-12-01 0
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
[180 rows x 1 columns]
or using reset_index():
In [213]: result.reset_index().head(20)
Out[213]:
month date val
0 1 1900-01-01 0
1 1 1900-01-02 0
2 1 1900-01-03 0
3 1 1900-01-04 0
4 1 1900-01-05 0
5 1 1900-01-06 0
6 1 1900-01-07 0
7 1 1900-01-08 0
8 1 1900-01-09 0
9 1 1900-01-10 0
10 1 1900-01-11 0
11 1 1900-01-12 0
12 1 1900-01-13 0
13 1 1900-01-14 0
14 1 1900-01-15 0
15 2 1900-02-01 0
16 2 1900-02-02 0
17 2 1900-02-03 0
18 2 1900-02-04 0
19 2 1900-02-05 0

Categories

Resources