add rows even if columns mismatch - python

I have an excel sheet, loaded into a DataFrame, whose tail() looks like this
ix date Type Value1 Value2 Value3
-------------------------------------------
651 01.02.2021 A 105 135 81
652 01.02.2021 B 3 10 1
653 01.02.2021 C 108 145 82
I have another DataFrame that instead look like this
0 02.02.2021 02.02.2021 02.02.2021
1 A B C
Value1 110 4 114
Val2 142 15 157
Value3 96 2 98
I want to add this latter dataframe transposed at the end of the first.
I have tried both append() and pd.concat but since column names do not always match (Value2 != Val2), values in the resulting columns end up being NaN.

If the first dataframe is df1 and the second is df2:
First transpose df2 and reset the index:
df3 = df2.T.reset_index()
If the dataframe df2 is always of the same form, you can simply overwrite the column names:
df3.columns = df1.columns
And then concat:
df = pd.concat([df1,df3],axis=0)
If the order of df2 is not always in the same and the misspellings can be different, you'll have to identify all possible misspellings first and for instance keep them in a dictionary like so:
mapping = {"Value1":"Value1","Value2":"Value2","Value3":"Value3","Val2":"Value2"}
Then assuming the value strings are in the index of df2, you overwrite the index:
df2.index = df2.index.map(mapping )
Afterwards you can perform the steps described above.

Related

Pandas: Search and match based on two conditions

I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))

Match one column with another data frame column and paste value from second data - Python

I have two data frames one contains the data second contains codes and their decode values. i want to match df1[code] with df2[code] and paste df2[value] in df1.
it should be noted that my second data frame contains code and values once, its basically a sheet of codes and values but in first data frame the codes are repeating so the values column which will be pasted should represent the value every time when the code appears in df1[code] column.
df1[code]
df2[code]
df2[value]
234
000
Three
235
234
Two
238
238
Four
337
235
Five
i need as following:
df1[code]
df1[value]
234
Two
235
Five
238
Four
337
Null
basically translation of codes in one data frame from second data frame.
Suppose that you dataframes are the following ones:
df1
code something some_number
0 210 SOMETHING_28 0.206017
1 913 SOMETHING_36 0.810195
2 210 SOMETHING_18 0.258638
3 None a 0.000000
df2
code value
0 210 VALUE_01
1 590 VALUE_02
2 614 VALUE_03
3 696 VALUE_04
4 913 VALUE_05
Then, you can use merge, changing the type of the code column, if needed (e.g., if it is a string):
df1.code = df1.code.map(lambda x: np.int64(x) if x else np.nan).astype('Int64')
df2.code = df2.code.astype('Int64')
merged_df = df1.merge(df2, on='code', how='left')
And you get:
code value
0 210 VALUE_01
1 913 VALUE_05
2 210 VALUE_01
3 <NA> NaN
Here the code to create df1 and df2 with the same structure as the ones shown in this answer:
import pandas as pd
import numpy as np
codes = sorted(np.random.randint(1, 1000, 5))
values = [f'VALUE_{x:02.0f}' for x in range(1, len(codes) + 1)]
df1 = pd.DataFrame(
data=[
[c, f'SOMETHING_{np.random.randint(1, 50)}', np.random.random()]
for c in np.random.choice(codes, 3)
],
columns=['code', 'something', 'some_number']
)
df2 = pd.DataFrame(
data=list(zip(codes, values)),
columns=['code', 'value']
)
How about use a map-dict:
map_dict = dict(zip(df2['code'], df2['value']))
df1['value'] = df1['code'].map(map_dict)

Joining 101 columns from a dictionary of dataframes

For the love of God! I have 101 single column features and I just want to join, or merge, or concatenate them so they all have the index of the first frame. I have all the frames in a dict already! I thought that would be the hard part.
Below I've done manually what I'd like to do. What I'd like to do is loop through the dict and get all 101 columns.
a=ddict['/Users/cb/Dropbox/Python Projects/Machine Learning/Data Series/Full Individual Stock Data/byd/1byd.xls']
b=ddict['/Users/cb/Dropbox/Python Projects/Machine Learning/Data Series/Full Individual Stock Data/byd/2byd.xls']
c=ddict['/Users/cb/Dropbox/Python Projects/Machine Learning/Data Series/Full Individual Stock Data/byd/3byd.xls']
d=a.join(b['Value'],lsuffix='_caller')
f=d.join(c['Value'],lsuffix='_caller')
f
You will need to
Create a first variable and set it to True. The first time we iterate through ou dict() we don't have anything to merge our dataframe with, so we will just assign the value to a variable
set the first variable to False so next time we will just merge our dataframe together
df.merge() and set left_index and right_index parameter to True so that our join happens on these index.
Below is a sample code.
Input
import pandas as pd
df = pd.DataFrame({'col1': [1,2,3,4]})
df1 = pd.DataFrame({'col2': [11,12,13,14]})
df2 = pd.DataFrame({'col3': [111,112,113,114]})
d = {'df':df, 'df1':df1, 'df2':df2}
first = True
for key, value in d.items():
if first:
n = value
first = False
else:
n = n.merge(value, left_index=True, right_index=True)
n.head()
output
col1 col2 col3
0 1 11 111
1 2 12 112
2 3 13 113
3 4 14 114
Here is a link to merge() for more information link
I would like to add that, if you want to keep the keys of the dictionary as the column headers of the final dataframe you just need to add this in the end:
n.columns=d.keys()

drop rows that have duplicated indices

I have a DataFrame where each observation is identified by an index. However, for some indices the DF contains several observations. One of them has the most updated data. I would like to drop the outdated duplicated rows based on values from some of the columns.
For example, in the following DataFrame, how can I drop the first and third rows with index = 122?
index col1 col2
122 - -
122 one two
122 - two
123 four one
124 five -
That is, I would like to get a final DF like this:
index col1 col2
122 one two
123 four one
124 five -
This seems to be a very common problem when we get data through several different retrievals over time. But I cannot figure out an efficient way of cleaning the data.
You could use groupby/transform to create a boolean mask which is True where the group count is greater than 1 and any of the values in the row equals '-'. Then you could use df.loc[~mask] to select the unmasked rows of df:
import pandas as pd
df = pd.read_table('data', sep='\s+')
count = df.groupby(['index'])['col1'].transform('count') > 1
mask = (df['col1'] == '-') | (df['col2'] == '-')
mask = mask & count
result = df.loc[~mask]
print(result)
yields
index col1 col2
0 122 one two
1 123 four one
2 124 five -
If the index is already a column then you can drop_duplicates and pass param take-last=True:
In [14]:
df.drop_duplicates('index', take_last=True)
Out[14]:
index col1 col2
1 122 - two
2 123 four one
if it's actually your index, then you'd be better off calling reset_index first and then perform the above step and then set the index back again.
There is a method for Index to call drop_duplicates but this just removed duplicates from the index, the returned index with duplicates removed does not allow you to index back into the df with the duplicates removed so I recommend the above approach by calling drop_duplicates on the df itself.
EDIT
Based on your new information, the easiest maybe to replace the outdated data with NaN values and drop these:
In [36]:
df.replace('-', np.NaN).dropna()
Out[36]:
col1 col2
index
122 one two
123 four one
Another Edit
What you could do is groupby the index and take the first values of the remaining columns, then call reset_index:
In [56]:
df.groupby('index')['col1', 'col2'].first().reset_index()
Out[56]:
index col1 col2
0 122 - -
1 123 four one
2 124 five -

Pandas: join 'on' failing

I have two DataFrames, df1:
ID value 1
0 5 162
1 7 185
2 11 156
and df2:
ID Comment
1 5
2 7 Yes!
6 11
... which I want to join using ID, with a result that looks like this:
ID value 1 Comment
5 162
7 185 Yes!
11 156
The real DataFrames are much larger and contain more columns, and I essentially want to add the Comment column from df2 to df1. I tried using
df1 = df1.join(df2['Comment'], on='ID')
... but that only gets me a new empty Comment column in df1, like .join somehow fails to use the ID column as the index. I have also tried
df1 = df1.join(df2['Comment'])
... but that uses the default indexes, which don't match between the two DataFrames (they also have different lengths), giving me a Comment value on the wrong place.
What am I doing wrong?
You can just do a merge to achieve what you want:
In [30]:
df1.merge(df2, on='ID')
Out[30]:
ID value1 Comment
0 5 162 None
1 7 185 Yes!
2 11 156 None
[3 rows x 3 columns]
The problem with join is that by default it performs a left index join, because your dataframes do not have a common index values that match then your comment column ends up being empty
EDIT
Following on from the comments, if you want to retain all values in df1 and add just the comments that are not empty and have ID's that exist in df1 then you can perform a left merge:
df1.merge(df2.dropna( subset=['Comment']), on='ID', how='left')
This will drop any rows with empty comments, use the ID column to merge both df1 and df2 to but perform a left merge so retains all values on left hand side but will merge comments that match ID column, the default is inner which retains IDs that are in both left and right dfs.
Further information on merge and further examples.

Categories

Resources