Merging 2 csv files - python

I want to merge 2 csv files. Resulting data frame column should have all the columns from csv 1.
For ex:
df1 = pd.DataFrame({'name': ['foo', 'bar', 'baz', 'foo'],'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'class': ['a', 'b', 'c', 'd'],'value': [5, 6, 7, 8]})
df3 = pd.merge(df1, df2,how='outer')
Result df3:
name value
foo 1
bar 2
baz 3
foo 5
NaN 6
NaN 7
NaN 8
How can i get the above result using joins?

This should get you sorted,
import pandas as pd
df1 = pd.DataFrame({'name': ['foo', 'bar', 'baz', 'foo'],'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'class': ['a', 'b', 'c', 'd'],'value': [5, 6, 7, 8]})
df3 = pd.merge(df1, df2,how='outer')
df3.drop([item for item in df2.columns if item not in df1.columns],axis = 1)
Which gives

import pandas as pd
df1 = pd.DataFrame({'name': ['foo', 'bar', 'baz', 'foo'],'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'class': ['a', 'b', 'c', 'd'],'value': [5, 6, 7, 8]})
result = pd.concat([df1, df2], axis=1)
print(result)

Related

Create DataFrame and set_index at once

This works:
import pandas as pd
data = [["aa", 1, 2], ["bb", 3, 4]]
df = pd.DataFrame(data, columns=['id', 'a', 'b'])
df = df.set_index('id')
print(df)
"""
a b
id
aa 1 2
bb 3 4
"""
but is it possible in just one call of pd.DataFrame(...) directly with a parameter, without using set_index after?
Convert values to 2d array:
data = [["aa", 1, 2], ["bb", 3, 4]]
arr = np.array(data)
df = pd.DataFrame(arr[:, 1:], columns=['a', 'b'], index=arr[:, 0])
print (df)
a b
aa 1 2
bb 3 4
Details:
print (arr)
[['aa' '1' '2']
['bb' '3' '4']]
Another solution:
data = [["aa", 1, 2], ["bb", 3, 4], ["cc", 30, 40]]
cols = ['a','b']
L = list(zip(*data))
print (L)
[('aa', 'bb', 'cc'), (1, 3, 30), (2, 4, 40)]
df = pd.DataFrame(dict(zip(cols, L[1:])), index=L[0])
print (df)
a b
aa 1 2
bb 3 4
cc 30 40

Specify columns to output with Pandas Merge function

import pandas as pd
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['dog', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8],
'valuea': [9, 10, 11, 12],
'valueb': [13, 14, 15, 16]})
I would like to merge these 2 dataframes based on 'value'. However I don't want the result to give me all of the columns in df2. I would like to keep the one with the 'valuea' column header but not the one with 'valueb' column header as per the squared output in the image.
The code I have tried is
df1.merge(df2, on ='value')
Is there a way exclude column with header = valueb using parameters in the merge function?
You cannot exclude columns with a parameter in the merge function.
Try these approaches instead:
pd.merge(df1, df2).drop(columns=['valueb'])
pd.merge(df1, df2.drop(columns=['valueb']))

Filling column of dataframe based on 'groups' of values of another column

I am trying to fill values of a column based on the value of another column. Suppose I have the following dataframe:
import pandas as pd
data = {'A': [4, 4, 5, 6],
'B': ['a', np.nan, np.nan, 'd']}
df = pd.DataFrame(data)
And I would like to fill column B but only if the value of column A equals 4. Hence, all rows that have the same value as another in column A should have the same value in column B (by filling this).
Thus, the desired output should be:
data = {'A': [4, 4, 5, 6],
'B': ['a', a, np.nan, 'd']}
df = pd.DataFrame(data)
I am aware of the fillna method, but this gives the wrong output as the third row also gets the value 'A' assigned:
df['B'] = fillna(method="ffill", inplace=True)
data = {'A': [4, 4, 5, 6],
'B': ['a', 'a', 'a', 'd']}
df = pd.DataFrame(data)
How can I get the desired output?
Try this:
df['B'] = df.groupby('A')['B'].ffill()
Output:
>>> df
A B
0 4 a
1 4 a
2 5 NaN
3 6 d

How to convert Pandas data frame to dict with values in a list

I have a huge Pandas data frame with the structure follows as an example below:
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'C', 'C', 'C'], 'col2': [1, 2, 5, 2, 4, 6]})
df
col1 col2
0 A 1
1 A 2
2 B 5
3 C 2
4 C 4
5 C 6
The task is to build a dictionary with elements in col1 as keys and corresponding elements in col2 as values. For the example above the output should be:
A -> [1, 2]
B -> [5]
C -> [2, 4, 6]
Although I write a solution as
from collections import defaultdict
dd = defaultdict(set)
for row in df.itertuples():
dd[row.col1].append(row.col2)
I wonder if somebody is aware of a more "Python-native" solution, using in-build pandas functions.
Without apply we do it by for loop
{x : y.tolist() for x , y in df.col2.groupby(df.col1)}
{'A': [1, 2], 'B': [5], 'C': [2, 4, 6]}
Use GroupBy.apply with list for Series of lists and then Series.to_dict:
d = df.groupby('col1')['col2'].apply(list).to_dict()
print (d)
{'A': [1, 2], 'B': [5], 'C': [2, 4, 6]}

Unique values from multipel column in pandas

distinct_values = df.col_name.unique().compute()
But what if I don't know the names of columns.
I think you need:
df = pd.DataFrame({"colA":['a', 'b', 'b', 'd', 'e'], "colB":[1,2,1,2,1]})
unique_dict = {}
# df.columns will give you list of columns in dataframe
for col in df.columns:
unique_dict[col] = list(df[col].unique())
Output:
{'colA': ['a', 'b', 'd', 'e'], 'colB': [1, 2]}
You can try this,
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 5]})
>>> d = dict()
>>> d['any_column_name'] = pd.unique(df.values.ravel('K'))
>>> d
{'any_column_name': array([1, 2, 3, 5])}
or for just one feature,
>>> d = dict()
>>> d['a'] = df['a'].unique()
>>> d
{'a': array([1, 2, 3])}
or individually for all,
>>> d = dict()
>>> for col in df.columns:
... d[col] = df[col].unique()
...
>>> d
{'a': array([1, 2, 3]), 'b': array([2, 3, 5])}

Categories

Resources