How to label same pandas dataframe rows?

How to label same pandas dataframe rows? - python

I have a large pandas dataframe like this:
log apple watermelon orange lemon grapes
1 1 1 yes 0 0
1 2 0 1 0 0
1 True 0 0 0 2
2 0 0 0 0 2
2 1 1 yes 0 0
2 0 0 0 0 2
2 0 0 0 0 2
3 True 0 0 0 2
4 0 0 0 0 2.1
4 0 0 0 0 2.1
How can I label the rows that are the same, for example:
log apple watermelon orange lemon grapes ID
1 1 1 yes 0 0 1
1 2 0 1 0 0 2
1 True 0 0 0 2 3
2 0 0 0 0 2 4
2 1 1 yes 0 0 1
2 0 0 0 0 2 4
2 0 0 0 0 2 4
3 True 0 0 0 2 3
4 0 0 0 0 2.1 5
4 0 0 0 0 2.1 5
I tried to:
df['ID']=df.groupby('log')[df.columns].transform('ID')
And
df['personid'] = df['log'].clip_upper(2) - 2*d.duplicated(subset='apple')
df
However, the above doesnt work because I literally have a lot of columns.
But its not giving me the expected output. Any idea of how to group and label this dataframe?

Given
x = io.StringIO("""log apple watermelon orange lemon grapes
1 1 1 yes 0 0
1 2 0 1 0 0
1 True 0 0 0 2
2 0 0 0 0 2
2 1 1 yes 0 0
2 0 0 0 0 2
2 0 0 0 0 2
3 True 0 0 0 2
4 0 0 0 0 2.1
4 0 0 0 0 2.1""")
df2 = pd.read_table(x, delim_whitespace=True)
You can first use transform with tuple to make each row hashable and comparable, and then play with indexes and range to create unique ids
f = df2.transform(tuple,1).to_frame()
k = f.groupby(0).sum()
k['id'] = range(1,len(k.index)+1)
And finally
df2['temp_key'] = f[0]
df2 = df2.set_index('temp_key')
df2['id'] = k.id
df2.reset_index().drop('temp_key', 1)
log apple watermelon orange lemon grapes id
0 1 1 1 yes 0 0.0 1
1 1 2 0 1 0 0.0 2
2 1 True 0 0 0 2.0 3
3 2 0 0 0 0 2.0 4
4 2 1 1 yes 0 0.0 5
5 2 0 0 0 0 2.0 4
6 2 0 0 0 0 2.0 4
7 3 True 0 0 0 2.0 6
8 4 0 0 0 0 2.1 7
9 4 0 0 0 0 2.1 7

Related

is there any way to convert the columns in Pandas Dataframe using its mirror image Dataframe structure

the df I have is :
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I wanted to obtain a Dataframe with columns reversed/mirror image :
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Is there any way to do that

You can check
df[:] = df.iloc[:,::-1]
df
Out[959]:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1

Here is a bit more verbose, but likely more efficient solution as it doesn't require to rewrite the data. It only renames and reorders the columns:
cols = df.columns
df.columns = df.columns[::-1]
df = df.loc[:,cols]
Or shorter variant:
df = df.iloc[:,::-1].set_axis(df.columns, axis=1)
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1

There are other ways, but here's one solution:
df[df.columns] = df[reversed(df.columns)]
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1

Count how many cells are between the last value in the dataframe and the end of the row

I'm using the pandas library in Python.
I have a data frame:
0 1 2 3 4
0 0 0 0 1 0
1 0 0 0 0 1
2 0 0 1 0 0
3 1 0 0 0 0
4 0 0 1 0 0
5 0 1 0 0 0
6 1 0 0 1 1
Is it possible to create a new column that is a count of the number of cells that are empty between the end of the row and the last value above zero? Example data frame below:
0 1 2 3 4 Value
0 0 0 0 1 0 1
1 0 0 0 0 1 0
2 0 0 1 0 0 2
3 1 0 0 0 0 4
4 0 0 1 0 0 2
5 0 1 0 0 0 3
6 1 0 0 1 1 0

using argmax
df['value'] = df.apply(lambda x: (x.iloc[::-1] == 1).argmax(),1)
##OR
using np.where
df['Value'] = np.where(df.iloc[:,::-1] == 1,True,False).argmax(1)
0 1 2 3 4 Value
0 0 0 0 1 0 1
1 0 0 0 0 1 0
2 0 0 1 0 0 2
3 1 0 0 0 0 4
4 0 0 1 0 0 2
5 0 1 0 0 0 3
6 1 0 0 1 1 0

Use:
df['new'] = df.iloc[:, ::-1].cumsum(axis=1).eq(0).sum(axis=1)
print (df)
0 1 2 3 4 new
0 0 0 0 1 0 1
1 0 0 0 0 1 0
2 0 0 1 0 0 2
3 1 0 0 0 0 4
4 0 0 1 0 0 2
5 0 1 0 0 0 3
6 1 0 0 1 1 0
Details:
First change order of columns by DataFrame.loc and slicing:
print (df.iloc[:, ::-1])
4 3 2 1 0
0 0 1 0 0 0
1 1 0 0 0 0
2 0 0 1 0 0
3 0 0 0 0 1
4 0 0 1 0 0
5 0 0 0 1 0
6 1 1 0 0 1
Then use cumulative sum per rows by DataFrame.cumsum:
print (df.iloc[:, ::-1].cumsum(axis=1))
4 3 2 1 0
0 0 1 1 1 1
1 1 1 1 1 1
2 0 0 1 1 1
3 0 0 0 0 1
4 0 0 1 1 1
5 0 0 0 1 1
6 1 2 2 2 3
Compare only 1 values by DataFrame.eq:
print (df.iloc[:, ::-1].cumsum(axis=1).eq(0))
4 3 2 1 0
0 True False False False False
1 False False False False False
2 True True False False False
3 True True True True False
4 True True False False False
5 True True True False False
6 False False False False False
And last count them per rows by sum:
print (df.iloc[:, ::-1].cumsum(axis=1).eq(0).sum(axis=1))
0 1
1 0
2 2
3 4
4 2
5 3
6 0
dtype: int64

Reindex Panda Multiindex

I am trying to create a new index for a dataframe from created from a root file. I'm using uproot to bring in the file using the command
upfile_muon = uproot.open(file_prefix_muon + '.root')
tree_muon = upfile_muon['ntupler']['tree']
df_muon = tree_muon.pandas.df(['vh_sim_r','vh_sim_phi','vh_sim_z','vh_sim_tp1','vh_sim_tp2',
'vh_type','vh_station','vh_ring','vh_sim_theta'], entrystop=args.max_events)
This then creates a multiindex pandas dataframe with entries and subentries as my two indexes. I want to filter out all subentries of length 3 or less. I do that with the following command while creating vectors that slice the dataframe into the data that I need.
a = 0
bad_entries = 0
entries = []
nuindex = []
tru = 0
while(a < args.max_events):
if(df_muon.loc[(a),:].shape[0] > 3):
entries.append(a)
b = 0
while( b < df_muon.loc[(a),:].shape[0]):
nuindex.append(tru)
b = b + 1
tru = tru + 1
else:
bad_entries = bad_entries + 1
a = a + 1
df_muon = df_muon.loc[pd.IndexSlice[entries,:],:]
So now my dataframe looks like this
vh_sim_r vh_sim_phi vh_sim_z vh_sim_tp1 vh_sim_tp2 vh_type vh_station vh_ring vh_sim_theta
entry subentry
0 0 149.724701 -124.728081 793.598755 0 0 3 2 1 10.684152
1 149.236725 -124.180763 796.001221 -1 -1 3 2 1 10.618716
2 149.456131 -124.687302 796.001221 0 0 3 2 1 10.633972
3 92.405533 -126.913628 539.349976 0 0 4 1 1 9.721958
4 149.345184 -124.332527 839.810669 0 0 1 2 1 10.083608
5 176.544983 -123.978333 964.500000 0 0 2 3 1 10.372764
6 194.614502 -123.764595 1054.994995 0 0 2 4 1 10.451831
7 149.236725 -124.180763 796.001221 -1 -1 3 2 1 10.618716
8 149.456131 -124.687302 796.001221 0 0 3 2 1 10.633972
9 92.405533 -126.913628 539.349976 0 0 4 1 1 9.721958
10 149.345184 -124.332527 839.810669 0 0 1 2 1 10.083608
11 176.544983 -123.978333 964.500000 0 0 2 3 1 10.372764
12 194.614502 -123.764595 1054.994995 0 0 2 4 1 10.451831
1 0 265.027252 -3.324370 796.001221 0 0 3 2 1 18.415092
1 272.908997 -3.531896 839.903625 0 0 1 2 1 18.000479
2 299.305176 -3.531351 923.885132 0 0 1 3 1 17.950438
3 312.799255 -3.499015 964.500000 0 0 2 3 1 17.968519
4 328.321442 -3.530087 1013.620056 0 0 1 4 1 17.947645
5 181.831726 -1.668625 567.971252 0 0 3 1 1 17.752077
6 265.027252 -3.324370 796.001221 0 0 3 2 1 18.415092
7 197.739120 -2.073746 615.796265 0 0 1 1 1 17.802410
8 272.908997 -3.531896 839.903625 0 0 1 2 1 18.000479
9 299.305176 -3.531351 923.885132 0 0 1 3 1 17.950438
10 312.799255 -3.499015 964.500000 0 0 2 3 1 17.968519
11 328.321442 -3.530087 1013.620056 0 0 1 4 1 17.947645
12 356.493073 -3.441958 1065.694946 0 0 2 4 2 18.495964
2 0 204.523163 -124.065643 839.835571 0 0 1 2 1 13.686690
1 135.439163 -122.568153 567.971252 0 0 3 1 1 13.412345
2 196.380875 -123.940300 796.001221 0 0 3 2 1 13.858652
3 129.801193 -122.348656 539.349976 0 0 4 1 1 13.531607
4 224.134796 -124.194283 923.877441 0 0 1 3 1 13.636631
5 237.166031 -124.181770 964.500000 0 0 2 3 1 13.814683
6 246.809235 -124.196938 1013.871643 0 0 1 4 1 13.681540
7 259.389587 -124.164017 1054.994995 0 0 2 4 1 13.813211
8 204.523163 -124.065643 839.835571 0 0 1 2 1 13.686690
9 196.380875 -123.940300 796.001221 0 0 3 2 1 13.858652
10 129.801193 -122.348656 539.349976 0 0 4 1 1 13.531607
11 224.134796 -124.194283 923.877441 0 0 1 3 1 13.636631
12 237.166031 -124.181770 964.500000 0 0 2 3 1 13.814683
13 246.809235 -124.196938 1013.871643 0 0 1 4 1 13.681540
14 259.389587 -124.164017 1054.994995 0 0 2 4 1 13.813211
3 0 120.722900 -22.053474 615.786621 0 0 1 1 4 11.091969
1 170.635376 -23.190208 793.598755 0 0 3 2 1 12.134683
2 110.061127 -21.370941 539.349976 0 0 4 1 1 11.533570
3 164.784668 -23.263920 814.977478 0 0 1 2 1 11.430829
4 192.868652 -23.398684 948.691345 0 0 1 3 1 11.491603
5 199.817978 -23.325649 968.900024 0 0 2 3 1 11.652840
6 211.474625 -23.265354 1038.803833 0 0 1 4 1 11.506759
7 216.406830 -23.275047 1059.395020 0 0 2 4 1 11.545199
8 170.612457 -23.136520 793.598755 -1 -1 3 2 1 12.133101
5 0 179.913177 -14.877813 615.749207 0 0 1 1 1 16.287615
1 160.188034 -14.731569 565.368774 0 0 3 1 1 15.819215
2 240.671204 -15.410946 793.598755 0 0 3 2 1 16.870745
3 166.238678 -14.774992 586.454590 0 0 1 1 1 15.826117
4 241.036865 -15.400753 815.009399 0 0 1 2 1 16.475443
5 281.086792 -15.534301 948.707581 0 0 1 3 1 16.503710
6 288.768768 -15.577776 968.900024 0 0 2 3 1 16.596043
7 309.145935 -15.533208 1038.588745 0 0 1 4 1 16.576143
8 312.951233 -15.579374 1059.395020 0 0 2 4 1 16.457436
9 312.313416 -16.685022 1059.395020 -1 -1 2 4 1 16.425705
Now my goal is to find a way to change the 5 value in the entry index to a 4. I want to do this in a way that automates the process such that I can have a huge number of entries (~20,000), I can have my filter delete the unusable entries, then it renumbers all of the entries sequentially from 0 to the last unfiltered entry. I've tried all sorts of commands but I've had no luck. Is there a way to do this directly?

df_muon = (df_muon
.reset_index() # Get the multi-index back as columns
.replace({'entry': 5}, {'entry': 4}) # Replace 5 in column 'entry' with 4
.set_index(['entry', 'subentry']) # Go back to the multi-index
)

Drop columns with more than 70% zeros

I would like to know if there is a command that drop columns that has more than 70% zeros or X% zeros. like:
df = df.loc[:, df.isnull().mean() < .7]
for NaN.
Thank you !

Just change df.isnull().mean() to (df==0).mean():
df = df.loc[:, (df==0).mean() < .7]
Here's a demo:
df
Out:
0 1 2 3 4
0 1 1 1 1 0
1 1 0 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 1 1 1 1
5 1 0 0 0 0
6 0 1 0 0 0
7 0 1 1 0 0
8 1 0 0 1 0
9 0 0 0 1 0
(df==0).mean()
Out:
0 0.4
1 0.5
2 0.6
3 0.5
4 0.8
dtype: float64
df.loc[:, (df==0).mean() < .7]
Out:
0 1 2 3
0 1 1 1 1
1 1 0 0 0
2 0 1 1 0
3 1 0 0 1
4 1 1 1 1
5 1 0 0 0
6 0 1 0 0
7 0 1 1 0
8 1 0 0 1
9 0 0 0 1

Convert a list of values to a time series in python

I want to convert the foll. data:
jan_1 jan_15 feb_1 feb_15 mar_1 mar_15 apr_1 apr_15 may_1 may_15 jun_1 jun_15 jul_1 jul_15 aug_1 aug_15 sep_1 sep_15 oct_1 oct_15 nov_1 nov_15 dec_1 dec_15
0 0 0 0 0 1 1 2 2 2 2 2 2 3 3 3 3 3 0 0 0 0 0 0
into a array of length 365, where each element is repeated till the next date days e.g. 0 is repeated from january 1 to january 15...
I could do something like numpy.repeat, but that is not date aware, so would not take into account that less than 15 days happen between feb_15 and mar_1.
Any pythonic solution for this?

You can use resample:
#add last value - 31 dec by value of last column of df
df['dec_31'] = df.iloc[:,-1]
#convert to datetime - see http://strftime.org/
df.columns = pd.to_datetime(df.columns, format='%b_%d')
#transpose and resample by days
df1 = df.T.resample('d').ffill()
df1.columns = ['col']
print (df1)
col
1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
1900-01-16 0
1900-01-17 0
1900-01-18 0
1900-01-19 0
1900-01-20 0
1900-01-21 0
1900-01-22 0
1900-01-23 0
1900-01-24 0
1900-01-25 0
1900-01-26 0
1900-01-27 0
1900-01-28 0
1900-01-29 0
1900-01-30 0
..
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
1900-12-16 0
1900-12-17 0
1900-12-18 0
1900-12-19 0
1900-12-20 0
1900-12-21 0
1900-12-22 0
1900-12-23 0
1900-12-24 0
1900-12-25 0
1900-12-26 0
1900-12-27 0
1900-12-28 0
1900-12-29 0
1900-12-30 0
1900-12-31 0
[365 rows x 1 columns]
#if need serie
print (df1.col)
1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
1900-01-16 0
1900-01-17 0
1900-01-18 0
1900-01-19 0
1900-01-20 0
1900-01-21 0
1900-01-22 0
1900-01-23 0
1900-01-24 0
1900-01-25 0
1900-01-26 0
1900-01-27 0
1900-01-28 0
1900-01-29 0
1900-01-30 0
..
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
1900-12-16 0
1900-12-17 0
1900-12-18 0
1900-12-19 0
1900-12-20 0
1900-12-21 0
1900-12-22 0
1900-12-23 0
1900-12-24 0
1900-12-25 0
1900-12-26 0
1900-12-27 0
1900-12-28 0
1900-12-29 0
1900-12-30 0
1900-12-31 0
Freq: D, Name: col, dtype: int64
#transpose and convert to numpy array
print (df1.T.values)
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

IIUC you can do it this way:
In [194]: %paste
# transpose DF, rename columns
x = df.T.reset_index().rename(columns={'index':'date', 0:'val'})
# parse dates
x['date'] = pd.to_datetime(x['date'], format='%b_%d')
# group resampled DF by the month and resample(`D`) each group
result = (x.groupby(x['date'].dt.month)
.apply(lambda x: x.set_index('date').resample('1D').ffill()))
# rename index names
result.index.names = ['month','date']
## -- End pasted text --
In [212]: result
Out[212]:
val
month date
1 1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
2 1900-02-01 0
1900-02-02 0
1900-02-03 0
1900-02-04 0
1900-02-05 0
1900-02-06 0
1900-02-07 0
1900-02-08 0
1900-02-09 0
1900-02-10 0
1900-02-11 0
1900-02-12 0
1900-02-13 0
1900-02-14 0
1900-02-15 0
... ...
11 1900-11-01 0
1900-11-02 0
1900-11-03 0
1900-11-04 0
1900-11-05 0
1900-11-06 0
1900-11-07 0
1900-11-08 0
1900-11-09 0
1900-11-10 0
1900-11-11 0
1900-11-12 0
1900-11-13 0
1900-11-14 0
1900-11-15 0
12 1900-12-01 0
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
[180 rows x 1 columns]
or using reset_index():
In [213]: result.reset_index().head(20)
Out[213]:
month date val
0 1 1900-01-01 0
1 1 1900-01-02 0
2 1 1900-01-03 0
3 1 1900-01-04 0
4 1 1900-01-05 0
5 1 1900-01-06 0
6 1 1900-01-07 0
7 1 1900-01-08 0
8 1 1900-01-09 0
9 1 1900-01-10 0
10 1 1900-01-11 0
11 1 1900-01-12 0
12 1 1900-01-13 0
13 1 1900-01-14 0
14 1 1900-01-15 0
15 2 1900-02-01 0
16 2 1900-02-02 0
17 2 1900-02-03 0
18 2 1900-02-04 0
19 2 1900-02-05 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to label same pandas dataframe rows? - python

Related

is there any way to convert the columns in Pandas Dataframe using its mirror image Dataframe structure

Count how many cells are between the last value in the dataframe and the end of the row

Reindex Panda Multiindex

Drop columns with more than 70% zeros

Convert a list of values to a time series in python

Categories

Resources