I have 2 dataframes, one of which has supplemental information for some (but not all) of the rows in the other.
names = df({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank'],
'classification':['thief','thief','good','thief']})
I would like to take the classification column from the info dataframe above and add it to the names dataframe above. However, when I do combined = pd.merge(names, info) the resulting dataframe is only 4 rows long. All of the rows that do not have supplemental info are dropped.
Ideally, I would have the values in those missing columns set to unknown. Resulting in a dataframe where some people are theives, some are good, and the rest are unknown.
EDIT:
One of the first answers I received suggested using merge outter which seems to do some weird things. Here is a code sample:
names = df({'names':['bob','frank','bob','bob','bob''james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev''sys','sys','sys','sup','sup','sup']})
info = df({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna("unknown")
The strange thing is that in the output I'll get a row where the resulting name is "bobjames" and another where position is "devsys". Finally, even though bill does not appear in the names dataframe it shows up in the resulting dataframe. So I really need a way to say lookup a value in this other dataframe and if you find something tack on those columns.
In case you are still looking for an answer for this:
The "strange" things that you described are due to some minor errors in your code. For example, the first (appearance of "bobjames" and "devsys") is due to the fact that you don't have a comma between those two values in your source dataframes. And the second is because pandas doesn't care about the name of your dataframe but cares about the name of your columns when merging (you have a dataframe called "names" but also your columns are called "names"). Otherwise, it seems that the merge is doing exactly what you are looking for:
import pandas as pd
names = pd.DataFrame({'names':['bob','frank','bob','bob','bob', 'james','tim','ricardo','mike','mark','joan','joe'],
'position':['dev','dev','dev','dev','dev','dev', 'sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank','joe','bill'],
'classification':['thief','thief','good','thief','good','thief']})
what = pd.merge(names, info, how="outer")
what.fillna('unknown', inplace=True)
which will result in:
names position classification
0 bob dev unknown
1 bob dev unknown
2 bob dev unknown
3 bob dev unknown
4 frank dev thief
5 james dev unknown
6 tim sys good
7 ricardo sys unknown
8 mike sys unknown
9 mark sup thief
10 joan sup unknown
11 joe sup thief
12 joe sup good
13 bill unknown thief
I think you want to perform an outer merge:
In [60]:
pd.merge(names, info, how='outer')
Out[60]:
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
There is section showing the type of merges can perform: http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
For outer or inner join also join function can be used. In the case above let's suppose that names is the main table (all rows from this table must occur in result). Then to run left outer join use:
what = names.set_index('names').join(info.set_index('names'), how='left')
resp.
what = names.set_index('names').join(info.set_index('names'), how='left').fillna("unknown")
set_index functions are used to create temporary index column (same in both tables). When dataframes would have contain such index column, then this step wouldn't be necessary. For example:
# define index when create dataframes
names = pd.DataFrame({'names':['bob',...],'position':['dev',...]}).set_index('names')
info = pd.DataFrame({'names':['joe',...],'classification':['thief',...]}).set_index('names')
what = names.join(info, how='left')
To perform other types of join just change how attribute (left/right/inner/outer are allowed). More info here
Think of it as an SQL join operation. You need a left-outer join[1].
names = pd.DataFrame({'names':['bob','frank','james','tim','ricardo','mike','mark','joan','joe'],'position':['dev','dev','dev','sys','sys','sys','sup','sup','sup']})
info = pd.DataFrame({'names':['joe','mark','tim','frank'],'classification':['thief','thief','good','thief']})
Since there are names for which there is no classification, a left-outer join will do the job.
a = pd.merge(names, info, how='left', on='names')
The result is ...
>>> a
names position classification
0 bob dev NaN
1 frank dev thief
2 james dev NaN
3 tim sys good
4 ricardo sys NaN
5 mike sys NaN
6 mark sup thief
7 joan sup NaN
8 joe sup thief
... which is fine. All the NaN results are ok if you look at both the tables.
Cheers!
[1] - http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging
Related
I am trying to collapse all the rows of a dataframe into one single row across all columns.
My data frame looks like the following:
name
job
value
bob
business
100
NAN
dentist
Nan
jack
Nan
Nan
I am trying to get the following output:
name
job
value
bob jack
business dentist
100
I am trying to group across all columns, I do not care if the value column is converted to dtype object (string).
I'm just trying to collapse all the rows across all columns.
I've tried groupby(index=0) but did not get good results.
You could apply join:
out = df.apply(lambda x: ' '.join(x.dropna().astype(str))).to_frame().T
Output:
name job value
0 bob jack business dentist 100.0
Try this:
new_df = df.agg(lambda x: x.dropna().astype(str).tolist()).str.join(' ').to_frame().T
Output:
>>> new_df
name job value
0 bob jack business dentist 100.0
Every month I collect data that contains details of employees to be stored in our database.
I need to find a solution to compare the data stored in the previous month to the data received and, for each row that any of the columns had a change, it would return into a new dataframe.
I would also need to know somehow which columns in each row of this new returned dataframe had a change when this comparison happened.
There are also some important details to mention:
Each column can also contain blank values in any of the dataframes;
The dataframes have the same column names but not necessarily the same data type;
The dataframes do not have the same number of rows necessarily;
If a row do not find its Index match, do not return to the new dataframe;
The rows of the dataframes can be matched by a column named "Index"
So, for example, we would have this dataframe (which is just a slice of the real one as it has 63 columns):
df1:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com
3 MKT $7600 Maria d 30-06-2021
4 I'T 8000 Peter az#i.com 14-07-2021
df2:
Index Department Salary Manager Email Start_Date
1 IT 6000.00 Jack ax#i.com 01-01-2021
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
5 IT 9000 John NOT PROVIDED
6 IT 9900 John NOT PROVIDED
df3:
Index Department Salary Manager Email Start_Date
2 HR 7000 O'Donnel ay#i.com 01-01-2021
3 MKT 7600 Maria dy#i.com 30-06-2021
4 IT 8000 Peter az#i.com 14-07-2021
**The differences in this example are:
Start date added in row of Index 2
Salary format corrected and email corrected for row Index 3
Department format corrected for row Index 4
What would be the best way to to this comparison?
I am not sure if there is an easy solution to understand what changed in each field but returning the dataframe with rows that had at least 1 change would be helpful.
Thank you for the support!
I think compare could do the trick: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html
But first you would need to align the rows between old and new dataframe via the index:
new_df_to_compare=new_df.loc[old_df.index]
When datatypes don't match. You would also need to align them:
new_df_to_compare = new_df_to_compare.astype(old_df.dtypes.to_dict())
Then compare should work just like this:
difference_df = old_df.compare(new_df_to_compare)
nocity.head()
user_id business_id stars
0 cjpdDjZyprfyDG3RlkVG3w uYHaNptLzDLoV_JZ_MuzUA 5
1 bjTcT8Ty4cJZhEOEo01FGA uYHaNptLzDLoV_JZ_MuzUA 3
2 AXgRULmWcME7J6Ix3I--ww uYHaNptLzDLoV_JZ_MuzUA 3
3 oU2SSOmsp_A8JYI7Z2JJ5w uYHaNptLzDLoV_JZ_MuzUA 4
4 0xtbPEna2Kei11vsU-U2Mw uYHaNptLzDLoV_JZ_MuzUA 5
withcity.head()
business_id city
0 YDf95gJZaq05wvo7hTQbbQ Richmond Heights
1 mLwM-h2YhXl2NCgdS84_Bw Charlotte
2 v2WhjAB3PIBA8J8VxG3wEg Toronto
3 CVtCbSB1zUcUWg-9TNGTuQ Scottsdale
4 duHFBe87uNSXImQmvBh87Q Phoenix
nocity dataframe has business_id, (they may be repeating since it also has the rating each user_id gave for each business_id)
The withcity dataframe has the city associated with each business_id
The result I want is:
This is going to be super hard to word:
I want to look up the city associated with each business_id from the withcity dataframe and create a new column in nocity called cityname, which now has the city name associated with that business_id
Why I gave up trying and came here
I know this can be performed with some sort of join operation.. But I don't understand which one exactly.. I looked them up online and got a little confused as to what would happen if some business_id wasn't available in the two dataframes when performing that join operation?
For example:
withcity has some business_id with some city value; and when performing whichever appropriate join with the nocity, it does not find that particular business_id
So I came here for help.
What other alternative did I try?
area_dict = dict(zip(withcity.business_id, withcity.city))
emptylist = []
for rows in nocity['business_id']:
for key, value in area_dict.items():
if(key == rows):
emptylist.append(value)
I created a dictionary which held the business_id and the city from the withcity dataframe, and performed some sort of matching comparison with the nocity dataframe.
But my method, will probably take a lot of time since there are 4.7 million records to be exact.
IIUC merge
nocity.merge(withcity,on='business_id',how='left')
Out[855]:
user_id business_id stars city
0 cjpdDjZyprfyDG3RlkVG3w uYHaNptLzDLoV_JZ_MuzUA 5 NaN
1 bjTcT8Ty4cJZhEOEo01FGA uYHaNptLzDLoV_JZ_MuzUA 3 NaN
2 AXgRULmWcME7J6Ix3I--ww uYHaNptLzDLoV_JZ_MuzUA 3 NaN
3 oU2SSOmsp_A8JYI7Z2JJ5w uYHaNptLzDLoV_JZ_MuzUA 4 NaN
4 0xtbPEna2Kei11vsU-U2Mw uYHaNptLzDLoV_JZ_MuzUA 5 NaN
In general, whenever you have a situation like this you want to consider avoiding loops and iterations and instead perform a merge. Then afterwards, you massage the data to fit your needs. For example, Wen's solution is the most apt way to do this.
However there were a few things I would add. Say those are my two dfs below:
Let's call the first and second dfs, nocity and withcity respectively.
You want to do:
nocity.merge(withcity, on='business_id', how='left')
However, if you end up getting nan values as Wen got above. Check the datatypes of your keys
Meaning, if you business_id field in nocity was int (for some reason) while the business_id field in withcity was str then Pandas will have issues merging the dataframes and you get NaN values instead of the desired City Name.
To check you would do
#for all datatypes in the nocity df
print(nocity.dtypes)
#or just for the field's dtypes
print(nocity.business_id.dtypes)
Then you would convert to a common datatype like str if they were different...
#example conversion of pandas column (series) to different datatype
nocity.business_id = nocity.business_id.astype(str)
withcity.business_id = withcity.business_id.astype(str)
#then perform merge as usual
nocity = nocity.merge(withcity, on='business_id', how='left')
Hope this helps. Also don't forget to change your name from 'city' to 'cityname' if that is what you prefer
nocity.rename(columns = {'city': 'city name'})
I have a huge dataset with about 60000 data. I would first use some criteria to do groupby on the whole dataset, and what I want to do next is to separate the whole dataset to many small datasets within the criteria and to run a function to each of the small dataset automatically to get a parameter for each small dataset. I have no idea on how to do this. Is there any code to make it possible?
This is what I have
Date name number
20100101 John 1
20100102 Kate 3
20100102 Kate 2
20100103 John 3
20100104 John 1
And I want it to be split into two small ones
Date name number
20100101 John 1
20100103 John 3
20100104 John 1
Date name number
20100102 Kate 3
20100102 Kate 2
I think a more efficient way than filtering the original data set using subsetting is groupby(), as a demo:
for _, g in df.groupby('name'):
print(g)
# Date name number
#0 20100101 John 1
#3 20100103 John 3
#4 20100104 John 1
# Date name number
#1 20100102 Kate 3
#2 20100102 Kate 2
So to get a list of small data frames, you can do [g for _, g in df.groupby('name')].
To expand on this answer, we can see more clearly what df.groupby() returns as follows:
for k, g in df.groupby('name'):
print(k)
print(g)
# John
# Date name number
# 0 20100101 John 1
# 3 20100103 John 3
# 4 20100104 John 1
# Kate
# Date name number
# 1 20100102 Kate 3
# 2 20100102 Kate 2
For each element returned by groupby(), it contains a key and a data frame with name which has a unique value of the key. In the above solution, we don't need the key, so we can just specify a position holder and discard it.
Unless your function is really slow, this can probably be accomplished by slicing (e.g. df_small = df[a:b] for some indices a and b). The only trick is to choose a and b. I use range in the code below but you could do it other ways:
param_list = []
n = 10000 #size of smaller dataframe
# loop up to 60000-n, n at a time
for i in range(0,60000-n,n):
# take a slice of big dataframe and apply function to get 'param'
df_small = df[i:i+n] #
param = function( df_small )
# keep our results in a list
param_list.append(param)
EDIT: Based on update, you could do something like this:
# loop through names
for i in df.name.values.unique():
# take a slice of big dataframe and apply function to get 'param'
df_small = df[df.name==i]
So my google-fu doesn't seem to be doing me justice with what seems like should be a trivial procedure.
In Pandas for Python I have 2 datasets, I want to merge them. This works fine using .concat. The issue is, .concat reorders my columns. From a data retrieval point of view, this is trivial. From a "I just want to open the file and quickly see the most important column" point of view, this is annoying.
File1.csv
Name Username Alias1
Tom Tomfoolery TJZ
Meryl MsMeryl Mer
Timmy Midsize Yoda
File2.csv
Name Username Alias 1 Alias 2
Bob Firedbob Fire Gingy
Tom Tomfoolery TJZ Awww
Result.csv
Alias1 Alias2 Name Username
0 TJZ NaN Tom Tomfoolery
1 Mer NaN Meryl MsMeryl
2 Yoda NaN Timmy Midsize
0 Fire Gingy Bob Firedbob
1 TJZ Awww Tom Tomfoolery
The result is fine, but in the data-file I'm working with I have 1,000 columns. The 2-3 most important are now in the middle. Is there a way, in this toy example, I could've forced "Username" to be the first column and "Name" to be the second column, preserving the values below each all the way down obviously.
Also as a side note, when I save to file it also saves that numbering on the side (0 1 2 0 1). If theres a way to prevent that too, that'd be cool. If not, its not a big deal since it's a quick fix to remove.
Thanks!
Assuming the concatenated DataFrame is df, you can perform the reordering of columns as follows:
important = ['Username', 'Name']
reordered = important + [c for c in df.columns if c not in important]
df = df[reordered]
print df
Output:
Username Name Alias1 Alias2
0 Tomfoolery Tom TJZ NaN
1 MsMeryl Meryl Mer NaN
2 Midsize Timmy Yoda NaN
0 Firedbob Bob Fire Gingy
1 Tomfoolery Tom TJZ Awww
The list of numbers [0, 1, 2, 0, 1] is the index of the DataFrame. To prevent them from being written to the output file, you can use the index=False option in to_csv():
df.to_csv('Result.csv', index=False, sep=' ')