Related
I want to add new files to historical table (both are in csv format and they are not in db), before that, I need to check new file with historical table by comparing its two column in particular, one is state and another one is date column. First, I need to check max (state, yyyy_mm), then check those entries with max(state, yyyy_mm) in historical table; if they are not historical table, then append them, otherwise do nothing.
So far I am able to pick the rows with max (state, yyyy_mm), but when I tried to compare those picked rows with historical table, I am not getting expected output. I tried pandas.merge, pandas.concat but output is not same with my expected output. Can anyone point me out how to do this in pandas? Any thoughts?
Input data:
>>> src_df.to_dict()
{'yyyy_mm': {0: 202001,
1: 202002,
2: 202003,
3: 202002,
4: 202107,
5: 202108,
6: 202109},
'state': {0: 'CA', 1: 'NJ', 2: 'NY', 3: 'NY', 4: 'PA', 5: 'PA', 6: 'PA'},
'col1': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3},
'col2': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4},
'col3': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7}}
>>> hist_df.to_dict()
{'yyyy_mm': {0: 202101,
1: 202002,
2: 202001,
3: 201901,
4: 201907,
5: 201908,
6: 201901,
7: 201907,
8: 201908},
'state': {0: 'CA',
1: 'NJ',
2: 'NY',
3: 'NY',
4: 'NY',
5: 'NY',
6: 'PA',
7: 'PA',
8: 'PA'},
'col1': {0: 1, 1: 3, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4},
'col2': {0: 1, 1: 3, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col3': {0: 1, 1: 7, 2: 8, 3: 8, 4: 8, 5: 8, 6: 8, 7: 8, 8: 8}}
My current attempt:
picked_rows = src_df.loc[src_df.groupby('state')['yyyy_mm'].idxmax()]
>>> picked_rows.to_dict()
{'yyyy_mm': {0: 202001, 1: 202002, 2: 202003, 6: 202109},
'state': {0: 'CA', 1: 'NJ', 2: 'NY', 6: 'PA'},
'col1': {0: 3, 1: 3, 2: 3, 6: 3},
'col2': {0: 3, 1: 3, 2: 3, 6: 4},
'col3': {0: 7, 1: 7, 2: 7, 6: 7}}
Then I tried to do following but output is not same as my expected output:
output_df = pd.concat(picked_rows, hist_df, keys=['state', 'yyyy_mm'], axis=1) # first attempt
output_df = pd.merge(picked_rows, hist_df, how='outer') # second attempt
but both of those attempt not giving me my expected output. How should I get my desired output by comparing two dataframes where picked_rows should be append to hist_df by conditionally such as max('state', 'yyyy_mm'). How should we do this in pandas?
objective
I want to check picked_rows in hist_df where I need to check by state and yyyy_mm columns, so only add entries from picked_rows where state has max value or recent dates. I created desired output below. I tried inner join or pandas.concat but it is not giving me correct out. Does anyone have any ideas on this?
Here is my desired output that I want to get:
yyyy_mm state col1 col2 col3
0 202101 CA 1 1 1
1 202002 NJ 3 3 7
2 202001 NY 4 5 8
3 201901 NY 4 5 8
4 201907 NY 4 5 8
5 201908 NY 4 5 8
6 201901 PA 4 5 8
7 201907 PA 4 5 8
8 201908 PA 4 5 8
9 202003 NY 3 3 7
10 202109 PA 3 4 7
You should change your picked_rows DataFrame to only include dates that are greater than the hist_df dates:
#keep only rows that are newer than in hist_df
new_data = src_df[src_df["yyyy_mm"].gt(src_df["state"].map(hist_df.groupby("state")["yyyy_mm"].max()))]
#of the new rows, keep the latest updated values
picked_rows = new_data.loc[new_data.groupby("state")["yyyy_mm"].idxmax()]
#concat to hist_df
output_df = pd.concat([hist_df, picked_rows], ignore_index=True)
>>> output_df
yyyy_mm state col1 col2 col3
0 202101 CA 1 1 1
1 202002 NJ 3 3 7
2 202001 NY 4 5 8
3 201901 NY 4 5 8
4 201907 NY 4 5 8
5 201908 NY 4 5 8
6 201901 PA 4 5 8
7 201907 PA 4 5 8
8 201908 PA 4 5 8
9 202003 NY 3 3 7
10 202109 PA 3 4 7
I am currently learning pandas and would like to know how can i get filter the rows whose column (that is a dictionary) has more than 3 keys in it. For example,
data = {'id':[1,2,3], 'types': [{1: 'a', 2:'b', 3:'c'},{1: 'a', 2:'b', 3:'c', 4:'d'}, {1: 'a', 2:'b', 3:'c'}]}
df = pd.dataframe(data)
How can i get the rows where the len of dictionary in column types is > 3
I tried doing
df[len(df['types']) > 3]
but it doesnt work. Any simple solution out there?
Use Series.apply or Series.map:
df = df[df['types'].apply(len) > 3]
#alternative
#df = df[df['types'].map(len) > 3]
print (df)
id types
1 2 {1: 'a', 2: 'b', 3: 'c', 4: 'd'}
Or Series.str.len:
df = df[df['types'].str.len() > 3]
I'm new to Stack Overflow and I have this Data Set :
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
3: '2017-03-02',
4: '2018-03-02',
5: '2020-03-04',
6: '2020-02-03'},
'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})
The variables 'ID' and 'Step' are Multi-index.
I would like to do two things :
FIRST :
If 'Num' is different for the same 'ID', then delete the rows of this ID.
SECONDLY :
For a same ID, the step 'F' should be the last one (with the most recent date). If not, then delete the rows of this ID.
I have some difficulties because the commands df['Step'] and df['ID'] are NOT WORKING ('ID' and 'Step' are Multi-Index cause of a recent groupby() ).
I've tried groupby(level=0) that I found on Multi index dataframe delete row with maximum value per group
But I still have some difficulties.
Could someone please help me?
Expected Output :
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4},
'Step': {0: 'A', 1: 'Bar', 2: 'F'},
'Num': {0: 38, 1: 38, 2: 38},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
'Occurence': {0: 3, 1: 3, 2: 3}})
The ID 88 has been removed because the step 'F' was not the last one step (with the most recent date). The ID 323 has been removed because Num 433!=Num 432.
Since you stated that ID and Step are in the index, we can do it this way:
df1[df1.sort_values('Date').groupby('ID')['Num']\
.transform(lambda x: (x.nunique() == 1) &
(x.index.get_level_values(1)[-1] == 'F'))]
Output:
Num Date Occurence
ID Step
4 A 38 2018-08-02 3
Bar 38 2018-12-02 3
F 38 2019-03-02 3
How?
First sort the dataframe by 'Date'
Then group the dataframe by ID
Taking each group of the dataframe and using the 'Num' column to transform in a boolean series, we
first get the number of unique elements of 'Num' in that
group, if that number is equal to 1, then you know that in that group
all 'Num's are the same and that is True
Secondly, and we get the inner level of the MultiIndex (level=1) and
we check the last value using indexing with [-1], if that value is
equal to 'F' then have a True also
Group the dataframe by column ID
Transform the Num column using nunique to identify the unique values
Transform the Step column using last to check whether the last value per group is F
Combine the boolean masks using logical and and filter the rows
g = df.groupby('ID')
m = g['Num'].transform('nunique').eq(1) & g['Step'].transform('last').eq('F')
print(df[m])
ID Step Num Date Occurence
0 4 A 38 2018-08-02 3
1 4 Bar 38 2018-12-02 3
2 4 F 38 2019-03-02 3
Alternative approach with groupby and filter but could be less efficient than the above approach
df.groupby('ID').filter(lambda g: g['Step'].iloc[-1] == 'F' and g['Num'].nunique() == 1)
ID Step Num Date Occurence
0 4 A 38 2018-08-02 3
1 4 Bar 38 2018-12-02 3
2 4 F 38 2019-03-02 3
Note: In case ID and Step are MultiIndex you have to reset the index before using the above proposed solutions.
I don't know if I understood correctly.
But you can try this
import os
import pandas as pd
sheet = pd.read_excel(io="you_file", sheet_name='sheet_name', na_filter=False, header=0 )
list_objects = []
for index,row in sheet.iterrows():
if (row['ID'] != index):
list_objects.append(row)
list_objects will be a list of dict
use groupby to find the rows with 1 occurrence. I drop the rows in the dataframe based on the ID return by the groupby results. I exclude IDs with one occurrence and not include those in the deletion.
df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
'Date': {0: '2018-08-02',
1: '2018-12-02',
2: '2019-03-02',
3: '2017-03-02',
4: '2018-03-02',
5: '2020-03-04',
6: '2020-02-03'},
'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})
df.set_index(['ID','Step'],inplace=True)
print(df)
print("If 'Num' is different for the same 'ID', then delete the rows of this ID.")
#exclude id with single occurrences
grouped=df.groupby([df.index.get_level_values(0)]).size().eq(1)
labels=set([x for x,y in (grouped[grouped.values==True].index)])
filter=[x for x in df.index.get_level_values(0) if x not in labels]
grouped = df[df.index.get_level_values(0).isin(filter)].groupby([df.index.get_level_values(0),'Num']).size().eq(1)
labels=set([x for x,y in (grouped[grouped.values==True].index)])
if len(labels)>0:
df = df.drop(labels=labels, axis=0,level=0)
print(df)
output:
Num Date Occurence
ID Step
4 A 38 2018-08-02 3
Bar 38 2018-12-02 3
F 38 2019-03-02 3
88 F 320 2017-03-02 2
Bar 320 2018-03-02 2
I am trying the following:
import pandas as pd
df = pd.DataFrame({'Col1': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'B'},
'Col2': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c'},
'Col3': {0: 42, 1: 28, 2: 56, 3: 62, 4: 48}})
ii = 1
for idx, row in df.iterrows():
print(row)
df.at[:, 'Col2'] = 'asd{}'.format(ii)
ii += 1
But the print statement above doesn't reflect the change df.at[:, 'Col2'] = 'asd'.format(ii). I need the print statements to reflect the change df.at[:, 'Col2'] = 'asd'.format(ii)
Edit: Since I am updating all rows of df, I was expecting the idx and row to grab new values from dataframe.
If this is not the right way to grab updated values from df through idx and row, then what is the correct approach. I need idx and row to reflect new values.
Expected output:
Col1 A
Col2 a
Col3 42
Name: 0, dtype: object
Col1 A
Col2 asd1
Col3 28
Name: 1, dtype: object
Col1 B
Col2 asd2
Col3 56
.....
From iterrows documentation:
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
As per your request for an alternative solution, here is one using DataFrame.apply:
df['Col2'] = df.apply(lambda row: 'asd{}'.format(row.name), axis=1)
Other examples (also using Series.apply) that may be useful for your eventual goal: (not clear what it is yet)
df['Col2'] = df['Col2'].apply(lambda x: 'asd{}'.format(x))
df['Col2'] = df.apply(lambda row: 'asd{}'.format(row['Col3']), axis=1)
Here is something you can try,
import pandas as pd
df = pd.DataFrame({'Col1': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'B'},
'Col2': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c'},
'Col3': {0: 42, 1: 28, 2: 56, 3: 62, 4: 48}})
print(
df.assign(idx=df.index)[['idx', 'Col2']]
.apply(lambda x: x['Col2'] if x['idx'] == 0 else f"asd{x['idx']}", axis=1)
)
0 a
1 asd1
2 asd2
3 asd3
4 asd4
dtype: object
I have 2 DataFrames containing examples, I would like to see if a example of DataFrame 1 is present in DataFrame 2.
Normally I would aggregate the rows per example and simply merge the DataFrames. Unfortunately the merging has to be done with a "matching table" which has a many-to-many relationship between the keys (id_low vs. id_high).
Simplified example
Matching Table:
Input DataFrames
They are therefore matchable like this:
Expected Output:
Simplified example (for Python)
import pandas as pd
# Dataframe 1 - containing 1 Example
d1 = pd.DataFrame.from_dict({'Example': {0: 'Example 1', 1: 'Example 1', 2: 'Example 1'},
'id_low': {0: 1, 1: 2, 2: 3}})
# DataFrame 2 - containing 1 Example
d2 = pd.DataFrame.from_dict({'Example': {0: 'Example 2', 1: 'Example 2', 2: 'Example 2'},
'id_low': {0: 1, 1: 4, 2: 6}})
# DataFrame 3 - matching table
dm = pd.DataFrame.from_dict({'id_low': {0: 1, 1: 2, 2: 2, 3: 3, 4: 3, 5: 4, 6: 5, 7: 6, 8: 6},
'id_high': {0: 'A',
1: 'B',
2: 'C',
3: 'D',
4: 'E',
5: 'B',
6: 'B',
7: 'E',
8: 'F'}})
d1 and d2 are matchable as you can see above.
Expected Output (or similar):
df_output = pd.DataFrame.from_dict({'Example': {0: 'Example 1'}, 'Example_2': {0: 'Example 2'}})
Failed attemps
Aggregation of with matching table translated values then merging. Considerer using Regex with the OR-Operator.
IIUC:
d2.merge(dm)
.merge(d1.merge(dm), on='id_high')\
.groupby(['Example_x','Example_y'])['id_high'].agg(list)\
.reset_index()
Output:
Example_x Example_y id_high
0 Example 2 Example 1 [A, B, E]