insert rows to pandas Dataframe based on condition? - python

I`m using pandas dataframe to read .csv format file. I would like to insert rows when specific column values changed from value to other. My data is shown as follow:
Id type
1 car
1 track
2 train
2 plane
3 car
I need to add row that contains Id is empty and type value is number 4 after any change in Id column value. My desired output should like this:
Id type
1 car
1 track
4
2 train
2 plane
4
3 car
How I do this??

You could use groupby to split by groups and append the rows in a list comprehension before merging again with contact:
df2 = pd.concat([d.append(pd.Series([None, 4], index=['Id', 'type']), ignore_index=True)
for _,d in df.groupby('Id')], ignore_index=True).iloc[:-1]
If the index is sorted, another option is to find the index of the last item per group and use it to generate the new rows:
# get index of last item per group (except last)
idx = df.index.to_series().groupby(df['Id']).last().values[:-1]
# craft a DataFrame with the new rows
d = pd.DataFrame([[None, 4]]*len(idx), columns=df.columns, index=idx)
# concatenate and reorder
pd.concat([df, d]).sort_index().reset_index(drop=True)
output:
Id type
0 1.0 car
1 1.0 track
2 NaN 4.0
3 2.0 train
4 2.0 plane
5 NaN 4.0
6 3.0 car

You can do this:
df = pd.read_csv('input.csv', sep=";")
Id type
0 1 car
1 1 track
2 2 train
3 2 plane
4 3 car
mask = df['Id'].ne(df['Id'].shift(-1))
df1 = pd.DataFrame('4',index=mask.index[mask] + .5, columns=df.columns)
df1['Id'] = df['Id'].replace({'4':' '})
df = pd.concat([df, df1]).sort_index().reset_index(drop=True).iloc[:-1]
which gives:
Id type
0 1.0 car
1 1.0 track
2 NaN 4
3 2.0 train
4 2.0 plane
5 NaN 4
6 3.0 car
​

You can do:
In [244]: grp = df.groupby('Id')
In [256]: res = pd.DataFrame()
In [257]: for x,y in grp:
...: if y['type'].count() > 1:
...: tmp = y.append(pd.DataFrame({'Id': [''], 'type':[4]}))
...: res = res.append(tmp)
...: else:
...: res = res.append(y)
...:
In [258]: res
Out[258]:
Id type
0 1 car
1 1 track
0 4
2 2 train
3 2 plane
0 4
4 3 car

Please find the solution below using index :
###### Create a shift variable to compare index
df['idshift'] = df['Id'].shift(1)
# When shift id does not match id, mean change index
change_index = df.index[df['idshift']!=df['Id']].tolist()
change_index
# Loop through all the change index and insert at index
for i in change_index[1:]:
line = pd.DataFrame({"Id": ' ' , "rate": 4}, index=[(i-1)+.5])
df = df.append(line, ignore_index=False)
# finallt sort the index
df = df.sort_index().reset_index(drop=True)
Input Dataframe :
df = pd.DataFrame({'Id': [1,1,2,2,3,3,3,4],'rate':[1,2,3,10,12,16,10,12]})
Ouput Results from the code :

Related

Compare and remove duplicates from both dataframe

I have 2 dataframes that needs to be compared and remove duplicates (if any)
Daily = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
Out[4]:
col1 col2
0 1 2
1 2 3
2 3 4
col1 col2
0 4 6
1 2 3
2 5 5
3 6 6
What I am trying to achieve is to remove duplicates if there are any, from both DF and get the count of remaining entries from daily DF
Expected output:
col1 col2
0 1 2
2 3 4
col1 col2
0 4 6
2 5 5
3 6 6
Count = 2
How can i do it?
Both or either DFs can be empty, and daily can have more entries than Montlhy and vice versa
Why don't just concat both into one df and drop the duplicates completely?
s = (pd.concat([Daily.assign(source="Daily"),
Accumulated.assign(source="Accumlated")])
.drop_duplicates(["col1","col2"], keep=False))
print (s[s["source"].eq("Daily")])
col1 col2 source
0 1 2 Daily
2 3 4 Daily
print (s[s["source"].eq("Accumlated")])
col1 col2 source
0 4 6 Accumlated
2 5 5 Accumlated
3 6 6 Accumlated
You can try the below code
## For 1st Dataframe
for i in range(len(df1)):
for j in range(len(df2)):
if df1.iloc[i].to_list()==df2.iloc[j].to_list():
df1=df1.drop(index=i)
Similarly you can do for the second datframe
I would do it following way:
import pandas as pd
daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
daily['isdaily'] = True
accumulated['isdaily'] = False
together = pd.concat([daily, accumulated])
without_dupes = together.drop_duplicates(['col1','col2'],keep=False)
daily_count = sum(without_dupes['isdaily'])
I added isdaily column to dataframes as Trues and Falses so they could be easily summed at end.
If I understood correctly, you need to have both tables separated.
You can concatenate them, keeping the table from where they come from and then recreate them:
Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Daily["Table"] = "Daily"
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
Accumulated["Table"] = "Accum"
df = pd.concat([Daily, Accumulated]).reset_index()
not_dup = df[["col1", "col2"]].drop_duplicates()
not_dup = df.loc[not_dup.index,:]
Daily = not_dup[not_dup["Table"] == "Daily"][["col1","col2"]]
Accumulated = not_dup[not_dup["Table"] == "Accum"][["col1","col2"]]
print(Daily)
print(Accumulated)
following those steps:
Concatenate the 2 data-frames
Drop all duplication
For each data-frame find the intersection with the concat data-frame
Find count with len
Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
df = pd.concat([Daily, Accumulated]) # step 1
df = df.drop_duplicates(keep=False) # step 2
Daily = pd.merge(df, Daily, how='inner', on=['col1','col2']) #step 3
Accumulated = pd.merge(df, Accumulated, how='inner', on=['col1','col2']) #step 3
count = len(Daily) #step 4

Enumerate rows by category

Enumerate values rows by category
I have the following dataframe that I'm ordering by category and values:
d = {"cat":["a","b","a","c","c"],"val" :[1,2,3,1,4] }
df = pd.DataFrame(d)
df = df.sort_values(["cat","val"])
Now from that dataframe I want to enumarate the occurrence of each category
so the result is as follows:
df["cat_count"] = [1,2,1,1,2]
Is there a way to automate this?
You can use cumcount like this. Details here cumcount
df['count'] = df.groupby('cat').cumcount()+1
print (df)
Output
cat val count
0 a 1 1
2 a 3 2
1 b 2 1
3 c 1 1
4 c 4 2

How to drop float values from a column - pandas

I have a dataframe given shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1],
'val' :[5,6.4,5.4,6,6,6]
})
It looks like as shown below
I would like to drop the values from val column which ends with .[1-9]. Basically I would like to retain values like 5.0,6.0 and drop values like 5.4,6.4 etc
Though I tried below, it isn't accurate
df['val'] = df['val'].astype(int)
df.drop_duplicates() # it doesn't give expected output and not accurate.
I expect my output to be like as shown below
First idea is compare original value with casted column to integer, also assign integers back for expected output (integers in column):
s = df['val']
df['val'] = df['val'].astype(int)
df = df[df['val'] == s]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
Another idea is test is_integer:
mask = df['val'].apply(lambda x: x.is_integer())
df['val'] = df['val'].astype(int)
df = df[mask]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
If need floats in output you can use:
df1 = df[ df['val'].astype(int) == df['val']]
print (df1)
subject_id val
0 1 5.0
3 1 6.0
4 1 6.0
5 1 6.0
Use mod 1 to determine the residual. If residual is 0 it means the number is a int. Then use the results as a mask to select only those rows.
df.loc[df.val.mod(1).eq(0)].astype(int)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6

Append data with one column to existing dataframe

I want append a list of data to a dataframe such that the list will appear in a column ie:
#Existing dataframe:
[A, 20150901, 20150902
1 4 5
4 2 7]
#list of data to append to column A:
data = [8,9,4]
#Required dataframe
[A, 20150901, 20150902
1 4 5
4 2 7
8, 0 0
9 0 0
4 0 0]
I am using the following:
df_new = df.copy(deep=True)
#I am copying and deleting data as column names are type Timestamp and easier to reuse them
df_new.drop(df_new.index, inplace=True)
for item in data_list:
df_new = df_new.append([{'A':item}], ignore_index=True)
df_new.fillna(0, inplace=True)
df = pd.concat([df, df_new], axis=0, ignore_index=True)
But doing this in a loop is inefficient plus I get this warning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
Any ideas on how to overcome this error and append 2 dataframes in one go?
I think need concat new DataFrame with column A, then reindex if want same order of columns and last replace missing values by fillna:
data = [8,9,4]
df_new = pd.DataFrame({'A':data})
df = (pd.concat([df, df_new], ignore_index=True)
.reindex(columns=df.columns)
.fillna(0, downcast='infer'))
print (df)
A 20150901 20150902
0 1 4 5
1 4 2 7
2 8 0 0
3 9 0 0
4 4 0 0
I think, you could do something like this.
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame({'A':[8,9,4]})
df.append(df2).fillna(0)
A B
0 1 2.0
1 3 4.0
0 8 0.0
1 9 0.0
2 4 0.0
maybe you can do it in this way:
new = pd.DataFrame(np.zeros((3, 3))) #Create a new zero dataframe:
new[0]=[8,9,4] #add values
existed_dataframe.append(new) #and merge both dataframes

Overwrite columns in DataFrames of different sizes pandas

I have following two Data Frames:
df1 = pd.DataFrame({'ids':[1,2,3,4,5],'cost':[0,0,1,1,0]})
df2 = pd.DataFrame({'ids':[1,5],'cost':[1,4]})
And I want to update the values of df1 with the ones on df2 whenever there is a match in the ids. The desired dataframe is this one:
df_result = pd.DataFrame({'ids':[1,2,3,4,5],'cost':[1,0,1,1,4]})
How can I get that from the above two dataframes?
I have tried using merge, but fewer records and it keeps both columns:
results = pd.merge(df1,df2,on='ids')
results.to_dict()
{'cost_x': {0: 0, 1: 0}, 'cost_y': {0: 1, 1: 4}, 'ids': {0: 1, 1: 5}}
You could do this with a left merge:
merged = pd.merge(df1, df2, on='ids', how='left')
merged['cost'] = merged.cost_x.where(merged.cost_y.isnull(), merged['cost_y'])
result = merged[['ids','cost']]
However you can avoid the need for the merge (and get better performance) if you set the ids as an index column; then pandas can use this to align the results for you:
df1 = df1.set_index('ids')
df2 = df2.set_index('ids')
df1.cost.where(~df1.index.isin(df2.index), df2.cost)
ids
1 1.0
2 0.0
3 1.0
4 1.0
5 4.0
Name: cost, dtype: float64
You can use set_index and combine first to give precedence to values in df2
df_result = df2.set_index('ids').combine_first(df1.set_index('ids'))
df_result.reset_index()
You get
ids cost
0 1 1
1 2 0
2 3 1
3 4 1
4 5 4
Another way to do it, using a temporary merged dataframe which you can discard after use.
import pandas as pd
df1 = pd.DataFrame({'ids':[1,2,3,4,5],'cost':[0,0,1,1,0]})
df2 = pd.DataFrame({'ids':[1,5],'cost':[1,4]})
dftemp = df1.merge(df2,on='ids',how='left', suffixes=('','_r'))
print(dftemp)
df1.loc[~pd.isnull(dftemp.cost_r), 'cost'] = dftemp.loc[~pd.isnull(dftemp.cost_r), 'cost_r']
del dftemp
df1 = df1[['ids','cost']]
print(df1)
OUTPUT-----:
dftemp:
cost ids cost_r
0 0 1 1.0
1 0 2 NaN
2 1 3 NaN
3 1 4 NaN
4 0 5 4.0
df1:
ids cost
0 1 1.0
1 2 0.0
2 3 1.0
3 4 1.0
4 5 4.0
A little late, but this did it for me and was faster that the accepted answer in tests:
df1.update(df2.set_index('ids').reindex(df1.set_index('ids').index).reset_index())

Categories

Resources