I have two dataframes.
feelingsDF with columns 'feeling', 'count', 'code'.
countryDF with columns 'feeling', 'countryCount'.
How do I make another dataframe that takes the columns from countryDF and combines it with the code column in feelingsDF?
I'm guessing you would need to somehow use same feeling column in feelingsDF to combine them and match sure the same code matches the same feeling.
I want the three columns to appear as:
[feeling][countryCount][code]
You are joining the two dataframes by the column 'feeling'. Assuming you only want the entries in 'feeling' that are common to both dataframes, you would want to do an inner join.
Here is a similar example with two dfs:
x = pd.DataFrame({'feeling': ['happy', 'sad', 'angry', 'upset', 'wow'], 'col1': [1,2,3,4,5]})
y = pd.DataFrame({'feeling': ['okay', 'happy', 'sad', 'not', 'wow'], 'col2': [20,23,44,10,15]})
x.merge(y,how='inner', on='feeling')
Output:
feeling col1 col2
0 happy 1 23
1 sad 2 44
2 wow 5 15
To drop the 'count' column, select the other columns of feelingsDF, and then sort by the 'countryCount' column. Note that this will leave your index out of order, but you can reindex the combined_df afterwards.
combined_df = feelingsDF[['feeling', 'code']].merge(countryDF, how='inner', on='feeling').sort_values('countryCount')
# To reset the index after sorting:
combined_df = combined_df.reset_index(drop=True)
You can join two dataframes using pd.merge. Assuming that you want to join on the feeling column, you can use:
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
See documentation for pd.merge to understand how to use the on and how parameters.
feelingsDF = pd.DataFrame([{'feeling':1,'count':10,'code':'X'},
{'feeling':2,'count':5,'code':'Y'},{'feeling':3,'count':1,'code':'Z'}])
feeling count code
0 1 10 X
1 2 5 Y
2 3 1 Z
countryDF = pd.DataFrame([{'feeling':1,'country':'US'},{'feeling':2,'country':'UK'},{'feeling':3,'country':'DE'}])
feeling country
0 1 US
1 2 UK
2 3 DE
df= pd.merge(feelingsDF, countryDF, on='feeling', how='left')
feeling count code country
0 1 10 X US
1 2 5 Y UK
2 3 1 Z DE
Related
I have two dataframes:
The first one looks like this:
variable
entry
subentry
0
1
X
2
Y
3
Z
and the second one looks like:
variable
entry
subentry
0
1
A
2
B
I would like to merge the two dataframe such that I get:
variable
entry
subentry
0
1
X
2
Y
3
Z
1
1
A
2
B
Simply using df1.append(df2, ignore_index=True) gives
variable
0
X
1
Y
2
Z
3
A
4
B
In other words, it collapses the multindex into a single index. Is there a way around this?
Edit: Here is a code sinppet that will reproduce the problem:
arrays = [
np.array([0,0,0]),
np.array([0,1,2]),]
arrays_2 = [
np.array([0,0]),
np.array([0,1]),]
df1 = pd.DataFrame(np.random.randn(3, 1), index=arrays)
df2 = pd.DataFrame(np.random.randn(2, 1), index=arrays_2)
df = df1.append(df2, ignore_index=True)
print(df)
Edit: In practice, I am looking ao combine N dataframes, each with a different number of "entry" rows. So I am looking for an approach that will not rely on me knowing the exact of the dataframes I am combining.
One way try:
pd.concat([df1, df2], keys=[0,1]).droplevel(1)
Output:
0
0 0 -0.439749
1 -0.478744
2 0.719870
1 0 -1.055648
1 -2.007242
Use pd.concat to concat the dataframes together and since entry is the same of both, use keys parameter to create a new level with the naming you want your level to be. Finally, go back and drop the old index level (where the value was the same).
New to coding here and trying to make a project. I want to compare two DF, and if any of the rows in the product column matches, I want to copy it over to a new DF. The rows in DF1 and DF2 will not be in the same position. Like I want to compare row 1 DF1 against the entire column in DF2. Is there an easy solution to this?
Take a look at this: https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/
You can try:
df3 = df1[df1['Product'].isin(set(df2['Product']))]
Which gives:
>>> df1 = pd.DataFrame({'prod':[1,2], 'ean':[5,6]})
>>> df1
prod ean
0 1 5
1 2 6
>>> df2 = pd.DataFrame({'prod':[3,2]})
>>> df2
prod
0 3
1 2
>>> df1[df1['prod'].isin(set(df2['prod']))]
prod ean
1 2 6
To explain:
df1[...] is to filter the rows of df1 based on criterion ...
I'm using a set() here so it is fast to check whether a row in df1 is in df2's "Product" column
I have mutiple DataFrames, each containing a row called 'location' and another row called 'value' (both make up the index). for example, suppose i have the following 2:
df1 = pd.DataFrame(np.array([[-4,2,5],['nyc','sf','chi']]), columns=['col1','col2','col3'], index=['value','location'])
df2 = pd.DataFrame(np.array([[5,0,-3],['nyc','sf','chi']]), columns=['col1','col2','col3'], index=['value','location'])
the DataFrames will be housed in a dictionary that I can iterate through. Ultimately, I want to retrieve the list of 'value's for each 'location' in a separate DataFrame. so the desired output would look like:
this is a toy example, while my real one will have many more DataFrames and the source DataFrames will have other rows besides the 2 key ones I am interested in
I would recommend set_index and concat:
(pd.concat([df.T.set_index('location')['value'] for df in [df1, df2]], axis=1)
.T
.reset_index(drop=True))
location nyc sf chi
0 -4 2 5
1 5 0 -3
Using merge
df1.T.merge(df2.T,on='location').set_index('location').T
location nyc sf chi
value_x -4 2 5
value_y 5 0 -3
Similar to this topic : Add default values while merging tables in pandas
The answer to this topic fills all NaN in the resulting DataFrame and that's not what I want to do.
Let's imagine the following situation : I have two dataframes df1 and df2. Each of this DataFrame might contains some Nan, the columns of df1 are 'a' and col1, the columns of df2 are 'a' and col2 where col1 and col2 are disjoints list of columns name (For example df1 and df2 could have respectively 'a', 'b', 'c' and 'a', 'd', 'e' as columns names). I want to perform a left merge on df1 and df2 and fill all the missing values of that merge(any row of df1 with a value of the column 'a' that is not a value of column 'a' in df2) with a default value. We can imagine that I have a dict default_values that match any element of col2 to a default values.
To give you a concrete example :
df1
a b c
0 0 0.038108 0.961687
1 1 0.107457 0.616689
2 2 0.661485 0.240353
3 3 0.457169 0.560912
4 5 5.000000 5.000000
df2
a d e
0 0 0.405170 0.934776
1 1 0.684532 0.168738
2 2 0.729693 0.967310
3 3 0.844770 NaN
4 4 0.842673 0.941324
default_values = {'d':42, 'e':43}
Expected Output :
a b c d e
0 0 0.038108 0.961687 0.405170 0.934776
1 1 0.107457 0.616689 0.684532 0.168738
2 2 0.661485 0.240353 0.729693 0.967310
3 3 0.457169 0.560912 0.844770 NaN
4 5 5.000000 5.000000 42 43
While writing this question, I found a working solution. I still think it's an interesting question. Here's a solution to get the expected output :
df3 = pd.DataFrame(default_values,
index = df1.set_index('a').index.difference(df2.a))
df3['a'] = df3.index
df1.merge(pd.concat((df2, df3), sort=False))
This solution works for a left/right merge, and it can be extended to work for an outer merge (by completing the first dataframe as well).
Edit : The how='left' argument is not specified in my merge because the DataFrame I'm merging with is constructed to have all the value of the column 'a' in df1 in its own column 'a'. We could add an how='left' to this call of merge, and it would give the same output.
A non-indexed df contains rows of gene, a cell that contains a mutation in that gene, and the type of mutation in that gene:
df = pd.DataFrame({'gene': ['one','one','one','two','two','two','three'],
'cell': ['A', 'A', 'C', 'A', 'B', 'C','A'],
'mutation': ['frameshift', 'missense', 'nonsense', '3UTR', '3UTR', '3UTR', '3UTR']})
df:
cell gene mutation
0 A one frameshift
1 A one missense
2 C one nonsense
3 A two 3UTR
4 B two 3UTR
5 C two 3UTR
6 A three 3UTR
I'd like to pivot this df so I can index by gene and set columns to cells. The trouble is that there can be multiple entries per cell: there can be multiple mutations in any one gene in a given cell (cell A has two different mutations in gene One). So when I run:
df.pivot_table(index='gene', columns='cell', values='mutation')
this happens:
DataError: No numeric types to aggregate
I'd like to use masking to perform the pivot while capturing the presence of at least one mutation:
A B C
gene
one 1 1 1
two 0 1 0
three 1 1 0
Solution with drop_duplicates and pivot_table:
df = df.drop_duplicates(['cell','gene'])
.pivot_table(index='gene',
columns='cell',
values='mutation',
aggfunc=len,
fill_value=0)
print (df)
cell A B C
gene
one 1 0 1
three 1 0 0
two 1 1 1
Another solution with drop_duplicates, groupby with aggregate size and last reshape by unstack:
df = df.drop_duplicates(['cell','gene'])
.groupby(['cell', 'gene'])
.size()
.unstack(0, fill_value=0)
print (df)
cell A B C
gene
one 1 0 1
three 1 0 0
two 1 1 1
The error message is not what is produced when you run pivot_table. You can have multiple values in the index for pivot_table. I don't believe this is true for the pivot method. You can however fix your problem by changing the aggregation to something that works on strings as opposed to numerics. Most aggregation functions operate on numeric columns and the code you wrote above would produce an error relating to the data type of the column not an index error.
df.pivot_table(index='gene',
columns='cell',
values='mutation',
aggfunc='count', fill_value=0)
If you only want 1 value per cell you can do a groupby and aggregate everything to 1 and then unstack a level.
df.groupby(['cell', 'gene']).agg(lambda x: 1).unstack(fill_value=0)