I am trying to loop over a data frame column which contains several lists and control if the values in the lists are countained by another data frame column.
I am pretty new to python and have this problem now for a longer time. I already tried to find a way to solve this problem with isin and str.contains, but I still got no match.
Here is the code I worked out so far:
data = [['yellow', 10,0], ['red', 15,0], ['blue', 14,0]]
df1 = pd.DataFrame(data, columns = ['Colour', 'Colour_id','Amount'])
df1
Colour Colour_id Amount
yellow 10 0
red 15 0
blue 14 0
data = [['tom',[10,15],200 ], ['adam', [10],50], ['john',[15,14],200]]
df2 = pd.DataFrame(data, columns = ['Colour', 'Colour_id','Amount'])
df2
Name Colour_id Amount
tom [10,15] 200
adam [10] 50
john [15,14] 200
for indices, row in df2.iterrows():
for i in row['Colour_id']:
if i in df1['Colour_id']:
df1['Amount']=df1['Amount']=df2['Amount']
else:
print("No")
The expect result should be that the Amount Column of df1 is filled like:
Colour Colour_id Amount
yellow 10 250
red 15 400
blue 14 200
At the moment I only get the "No" of the else condition.
Idea is create Series by convert list column to DataFrame, reshape by stack and aggregate sum and then Series.map:
df3 = pd.DataFrame(df2['Colour_id'].values.tolist(), index=df2['Amount']).stack().reset_index()
s = df3.groupby(0)['Amount'].sum()
df1['Amount'] = df1['Colour_id'].map(s)
print (df1)
Colour Colour_id Amount
0 yellow 10 250
1 red 15 400
2 blue 14 200
Or use defaultdict with pure python fo dictionary by summing values and map for new column:
from collections import defaultdict
d = defaultdict(int)
for cid, Amount in zip(df2['Colour_id'], df2['Amount']):
for x in cid:
d[x] += Amount
print (d)
defaultdict(<class 'int'>, {10: 250, 15: 400, 14: 200})
df1['Amount'] = df1['Colour_id'].map(s)
print (df1)
Colour Colour_id Amount
0 yellow 10 250
1 red 15 400
2 blue 14 200
Related
I have the following dataframe in Pandas:
name
value
in
out
A
50
1
0
A
-20
0
1
B
150
1
0
C
10
1
0
D
500
1
0
D
-250
0
1
E
800
1
0
There are maximally only 2 observations for each name: one for in and one for out.
If there is only in for a name there is only one observation for it.
You can create this dataset with this code:
data = {
'name': ['A','A','B','C','D','D','E'],
'values': [50,-20,150,10,500,-250,800],
'in': [1,0,1,1,1,0,1],
'out': [0,1,0,0,0,1,0]
}
df = pd.DataFrame.from_dict(data)
I want to sum the value column for each name but only if name has both in and out record. In other words, only when one unique name has exactly 2 rows.
The result should look like this:
name
value
A
30
D
250
If I run the following code I got all the results without filtering based on in and out.
df.groupby('name').sum()
name
value
A
30
B
150
C
10
D
250
E
800
How to add the beforementioned filtering based on columns?
Maybe you can try something with groupby, agg, and query (like below):
df.groupby('name').agg({'name':'count', 'values': 'sum'}).query('name>1')[['values']]
Output:
values
name
A 30
D 250
You could also make .query('name==2') in above if you like but assuming it can occur max at 2 .query('name>1') would also return same.
IIUC, you could filter before aggregation:
# check that we have exactly 1 in and 1 out per group
mask = df.groupby('name')[['in', 'out']].transform('sum').eq([1,1]).all(1)
# slice the correct groups and aggregate
out = df[mask].groupby('name', as_index=False)['values'].sum()
Or, you could filter afterwards (maybe less efficient if you have a lot of groups that would be filtered out):
(df.groupby('name', as_index=False).sum()
.loc[lambda d: d['in'].eq(1) & d['out'].eq(1), ['name', 'values']]
)
output:
name values
0 A 30
1 D 250
I am going to merge two datasets soon by 3 columns.
The hope is that there are no/few 3 column group repeats in the original dataset. I would like to produce something that says approximately how unique each row is. Like maybe some kind of frequency plot (might not work as I have a very large dataset), maybe a table that displays the average frequency for each .5million rows or something like that.
Is there a way to determine how unique each row is compared to the other rows?
1 2 3
A 100 B
A 200 B
A 200 B
Like for the above data frame, I would like to say that each row is unique
1 2 3
A 200 B
A 200 B
A 100 B
For this data set, rows 1 and 2 are not unique. I don't want to drop one, but I am hoping to quantify/weight the amount of non-unique rows.
The problem is my dataframe is 14,000,000 lines long, so I need to think of a way I can show how unique each row is on a set this big.
Assuming you are using pandas, here's one possible way:
import pandas as pd
# Setup, which you can probably skip since you already have the data.
cols = ["1", "2", "3"]
rows = [
["A", 200, "B",],
["A", 200, "B",],
["A", 100, "B",],
]
df1 = pd.DataFrame(rows, columns=cols)
# Get focus column values before adding a new column.
key_columns = df1.columns.values.tolist()
# Add a line column
df1["line"] = 1
# Set new column to cumulative sum of line values.
df1["match_count"] = df1.groupby(key_columns )['line'].apply(lambda x: x.cumsum())
# Drop line column.
df1.drop("line", axis=1, inplace=True)
Print results
print(df1)
Output -
1 2 3 match_count
0 A 200 B 1
1 A 200 B 2
2 A 100 B 1
Return only unique rows:
# We only want results where the count is less than 2,
# because we have our key columns saved, we can just return those
# and not worry about 'match_count'
df_unique = df1.loc[df1["match_count"] < 2, key_columns]
print(df_unique)
Output -
1 2 3
0 A 200 B
2 A 100 B
I have a pandas dataframe which looks like as follows:
df =
key value
1 Level 1
2 Age 35
3 Height 180
4 Gender 0
...
and a dictionary as follows:
my_dict = {
'Level':{0: 'Low', 1:'Medium', 2:'High'},
'Gender': {0: 'Female', 1: 'Male'}
}
I want to map from the dictionary to the dataframe and change the 'value' column with its corresponding value in the dictionary such as the output becomes:
key value
1 Level Medium
2 Age 35
3 Height 180
4 Gender Female
...
Its okay for other values in the column become a string as well. How can I achieve this? Thanks for the help.
Check with replace
out = df.set_index('key').T.replace(my_dict).T.reset_index()
out
Out[27]:
key value
0 Level Medium
1 Age 35
2 Height 180
3 Gender Female
df.at[1, 'value'] = my_dict['Level'][df.at[1, 'value']]
df.at[4, 'value'] = my_dict['Gender'][df.at[4, 'value']]
I have been struggling with appending multiple DataFrames with varying columns and, would really appreciate your help with this problem!
My original data set looks like below
df1 = height 10
color 25
weight 3
speed 33
df2 = height 51
color 25
weight 30
speed 33
df3 = height 51
color 25
speed 30
I call transform_csv_data(csv_data, row) function to first add name on the last row. Then I transpose and move the name which becomes the last column to the first column for every DataFrame so each DataFrame looks like below before appending (but before moving the last column to front)
df1 =
0 1 2 3 4
0 height color weight speed name
1 10 25 3 33 Joe
df2 =
0 1 2 3 4
0 height color weight speed name
1 51 25 30 33 Bob
df3 =
0 1 2 3
0 height color speed name
1 51 25 30 Chris
The problem is appending DataFrames with different number of columns and each DataFrame contains two rows including header and Data as above.
The code for transform_csv_data helper function is shown below
def transform_csv_data(self, csv_data, row):
df = pd.DataFrame(list(csv_data))
df = df.iloc[:, [0, -2]] # all rows with first and second last column
df.loc[len(df)] = ['name', row]
df = df.transpose()
cols = df.columns.values.tolist() # this returns index of each column
cols.insert(0, cols.pop(-1)) # move last column to front
df = df.reindex(columns=cols)
return df
My main function for appending DataFrame is shown below
def aggregate_data(self, output_data_file_path):
df_output = pd.DataFrame()
rows = ['Joe', 'Bob', 'Chris']
for index, row in enumerate(rows):
csv_data = self.read_csv_url(row)
df = self.transform_csv_data(csv_data, row)
# ignore header unless first set of data is being processed
if index != 0 or append:
df = df[1:]
df_output = df_output.append(df)
df_output.to_csv(output_data_file_path, index=False, header=False, mode='a+')
I want my final appended DatFrame to become as below but format becomes weird as the name column goes back to the end of the column
final =
name height color weight speed
Joe 10 25 3 33
Bob 51 25 30 33
Chris 51 25 nan 30
How can I append all the DataFrame properly so data is appended to its corresponding column?
I have tried adding concat, merge, df_output = df_output.append(df_row)[df_output.columns.tolist()] but no luck so far
There are also duplicate columns which I would like to keep.
Thank you so much for your help
I have a big dataset (2m rows, 70 variables), which has many categorical variables. All categorical variables are coded in numbers (e.g. see df1)
df1:
obs gender job
1 1 1
2 1 2
3 2 2
4 1 1
I have a another data frame with all explanations, looking like this:
df2:
Var: Value: Label:
gender 1 male
gender 2 female
job 1 blue collar
job 2 white collar
Is there a fast way to replace all values of the categorical columns with their label from df2? This would save me the work to always look up the meaning of the value in df2. I found some solutions to replace values by hand, but I look for an automatic way doing this.
Thank you
You could use a dictionary generated from df2. Like this:
Firstly, generating some dummy data:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['obs'] = range(1,1001)
df1['gender'] = np.random.choice([1,2],1000)
df1['job'] = np.random.choice([1,2],1000)
df2 = pd.DataFrame()
df2['var'] = ['gender','gender','job','job']
df2['value'] = [1,2,1,2]
df2['label'] = ['male','female','blue collar', 'white collar']
If you want to replace one variable something like this:
genderDict = dict(df2.loc[df2['var']=='gender'][['value','label']].values)
df1['gender_name'] = df1['gender'].apply(lambda x: genderDict[x])
And if you'd like to replace a bunch of variables:
colNames = list(df1.columns)
colNames.remove('obs')
for variable in colNames:
varDict = dict(df2.loc[df2['var']==variable][['value','label']].values)
df1[variable+'_name'] = df1[variable].apply(lambda x: varDict[x])
For a million rows it takes about 1 second so should be reasonable fast.
Create a mapper dictionary from df2 using groupby
d = df2.groupby('Var').apply(lambda x: dict(zip(x['Value'], x['Label']))).to_dict()
{'gender': {1: 'male', 2: 'female'},
'job': {1: 'blue collar', 2: 'white collar'}}
Now map the values in df1 using outer key of the dictionary as column and inner dictionary is mapper
for col in df1.columns:
if col in d.keys():
df1[col] = df1[col].map(d[col])
You get
obs gender job
0 1 male blue collar
1 2 male white collar
2 3 female white collar
3 4 male blue collar