I have been struggling with appending multiple DataFrames with varying columns and, would really appreciate your help with this problem!
My original data set looks like below
df1 = height 10
color 25
weight 3
speed 33
df2 = height 51
color 25
weight 30
speed 33
df3 = height 51
color 25
speed 30
I call transform_csv_data(csv_data, row) function to first add name on the last row. Then I transpose and move the name which becomes the last column to the first column for every DataFrame so each DataFrame looks like below before appending (but before moving the last column to front)
df1 =
0 1 2 3 4
0 height color weight speed name
1 10 25 3 33 Joe
df2 =
0 1 2 3 4
0 height color weight speed name
1 51 25 30 33 Bob
df3 =
0 1 2 3
0 height color speed name
1 51 25 30 Chris
The problem is appending DataFrames with different number of columns and each DataFrame contains two rows including header and Data as above.
The code for transform_csv_data helper function is shown below
def transform_csv_data(self, csv_data, row):
df = pd.DataFrame(list(csv_data))
df = df.iloc[:, [0, -2]] # all rows with first and second last column
df.loc[len(df)] = ['name', row]
df = df.transpose()
cols = df.columns.values.tolist() # this returns index of each column
cols.insert(0, cols.pop(-1)) # move last column to front
df = df.reindex(columns=cols)
return df
My main function for appending DataFrame is shown below
def aggregate_data(self, output_data_file_path):
df_output = pd.DataFrame()
rows = ['Joe', 'Bob', 'Chris']
for index, row in enumerate(rows):
csv_data = self.read_csv_url(row)
df = self.transform_csv_data(csv_data, row)
# ignore header unless first set of data is being processed
if index != 0 or append:
df = df[1:]
df_output = df_output.append(df)
df_output.to_csv(output_data_file_path, index=False, header=False, mode='a+')
I want my final appended DatFrame to become as below but format becomes weird as the name column goes back to the end of the column
final =
name height color weight speed
Joe 10 25 3 33
Bob 51 25 30 33
Chris 51 25 nan 30
How can I append all the DataFrame properly so data is appended to its corresponding column?
I have tried adding concat, merge, df_output = df_output.append(df_row)[df_output.columns.tolist()] but no luck so far
There are also duplicate columns which I would like to keep.
Thank you so much for your help
Related
I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))
For the columns with name containing a specific string Time, I would like to create a new column with the same name. I want for each item of Pax_cols (if there are more than one) to update the column with the sum with the column Temp.
data={'Run_Time':[60,20,30,45,70,100],'Temp':[10,20,30,50,60,100], 'Rest_Time':[5,5,5,5,5,5]}
df=pd.DataFrame(data)
Pax_cols = [col for col in df.columns if 'Time' in col]
df[Pax_cols[0]]= df[Pax_cols[0]] + df["Temp"]
This is what I came up with, if Pax_cols has only one values, but it does not work.
Expected output:
data={'Run_Time':[70,40,60,95,130,200],'Temp':[10,20,30,50,60,100], 'Rest_Time':[15,25,35,55,65,105]}
You can use:
# get columns with "Time" in the name
cols = list(df.filter(like='Time'))
# ['Run_Time', 'Rest_Time']
# add the value of df['Temp']
df[cols] = df[cols].add(df['Temp'], axis=0)
output:
Run_Time Temp Rest_Time
0 70 10 15
1 40 20 25
2 60 30 35
3 95 50 55
4 130 60 65
5 200 100 105
I have a column that looks like this:
Age
[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)
and want to remove the "[","-" and ")". Instead of showing the range such as 0-10, I would like to show the middle value instead for every row in the column
Yet another solution:
The dataframe:
df = pd.DataFrame({'Age':['[0-10)','[10-20)','[20-30)','[30-40)','[40-50)','[50-60)','[60-70)','[70-80)']})
df
Age
0 [0-10)
1 [10-20)
2 [20-30)
3 [30-40)
4 [40-50)
5 [50-60)
6 [60-70)
7 [70-80)
The code:
df['Age'] = df.Age.str.extract('(\d+)-(\d+)').astype('int').mean(axis=1).astype('int')
The result:
df
Age
0 5
1 15
2 25
3 35
4 45
5 55
6 65
7 75
If you want to explode a row into multiple rows where each row carries a value from the range, you can do this:
data = '''[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)'''
df = pd.DataFrame({'Age': data.splitlines()})
df['Age'] = df['Age'].str.extract(r'\[(\d+)-(\d+)\)').astype(int).apply(lambda r: list(range(r[0], r[1])), axis=1)
df.explode('Age')
Note that I assume your Age column is string typed, so I used extract to get the boundaries of the ranges, and convert them to a real list of integers. Finally explode your dataframe for the modified Age column will get you a new row for each integer in the list. Values in other columns will be copied accordingly.
I tried this:
import pandas as pd
import re
data = {
'age_range': [
'[0-10)',
'[10-20)',
'[20-30)',
'[30-40)',
'[40-50)',
'[50-60)',
'[60-70)',
'[70-80)',
]
}
df = pd.DataFrame(data=data)
def get_middle_age(age_range):
pattern = r'(\d+)'
ages = re.findall(pattern, age_range)
return int((int(ages[0])+int(ages[1]))/2)
df['age'] = df.apply(lambda row: get_middle_age(row['age_range']), axis=1)
What's the best practice for such a simple task?
Is there any way to speed up the messy lines of code with a simple method?
Let's say we have the following raw data:
#Example Data
df = pd.read_csv('example.csv', header = None, sep=';')
display(df)
To manipulate the data appropriately we transpose and edit:
#Transpose the Dataframe
df = df.transpose()
display(df)
#Set the first row as a header
header_row = 0 #<-- Input, should be mandatory? could it be default?
#Set the column names to the first row values
df.columns = df.iloc[header_row]
#Get rid of the unnecessary row
df = df.drop(header_row).reset_index(drop=True)
#Display final result
display(df)
you can simply do that by
df.set_index(0,inplace=True)
df.T
an example of same kind is given below
df2=pd.DataFrame({"Col":[9,8,0],"Row":[6,4,22],"id":[26,55,27]})
df2
Col Row id
0 9 6 26
1 8 4 55
2 0 22 27
df2.set_index("id",inplace=True)
df2.T
id 26 55 27
Col 9 8 0
Row 6 4 22
Is there a way to get data from one dataframe using for example, loc, based on a condition and if true, retrieving more than one value (I.e. from different columns) in that row and append them to another dataframe in a new row (e.g. location: index+1) but storing each value in a column different than the original dataframe’s column names. And without using loops.
I originally did accomplished this by looping through the dataframe and appending the desired data to the new dataframe, but I can't use a loop.
What I have:
data.columns = […,"Age", …., "Pclass"] # columns needed: "Age" and "Pclass"
pas12to19.columns = ["AGE_12", "AGE_TEEN", "PCLASS"]
Add ages 12 to new df and ages 13-19 to new df
df12 = data.loc[((data['Age'] == 12)), ['Age', 'Pclass']]
df13to19 = data.loc[((data['Age'] < 20) & (data['Age'] > 12)), ['Age', 'Pclass']]
store in pas12to19 df
pas12to19 = pas12to19.append(df12)
pas12to19 = pas12to19.append(df13to19)
This method adds the column names from the original dataframe to the new dataframe.
I would then need to filter over pas12to19 dataframe and move the values in one column to their appropriate column then delete the columns Age and Pclass that came from the original dataframe. I would think there is a better approach.
What I would like to do (but isn't allowed):
pas12to19["AGE_TEEN", "PCLASS"] = data.loc[((data['Age'] < 20) & (data['Age'] > 12)), ['Age', 'Pclass']]
I.e. Store Age in AGE_TEEN and store Pclass in PCLASS.
Original df table Data
... Pclass Age ...
3 17
1 15
3 13
1 19
2 15
New and expected df table pas12to19
AGE_12 AGE_TEEN PCLASS
NaN 17 3
NaN 15 1
NaN 13 2
12 NaN 1
NaN 15 1