Merging 2 csv files

Merging 2 csv files - python

I want to merge 2 csv files. Resulting data frame column should have all the columns from csv 1.
For ex:
df1 = pd.DataFrame({'name': ['foo', 'bar', 'baz', 'foo'],'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'class': ['a', 'b', 'c', 'd'],'value': [5, 6, 7, 8]})
df3 = pd.merge(df1, df2,how='outer')
Result df3:
name value
foo 1
bar 2
baz 3
foo 5
NaN 6
NaN 7
NaN 8
How can i get the above result using joins?

This should get you sorted,
import pandas as pd
df1 = pd.DataFrame({'name': ['foo', 'bar', 'baz', 'foo'],'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'class': ['a', 'b', 'c', 'd'],'value': [5, 6, 7, 8]})
df3 = pd.merge(df1, df2,how='outer')
df3.drop([item for item in df2.columns if item not in df1.columns],axis = 1)
Which gives

import pandas as pd
df1 = pd.DataFrame({'name': ['foo', 'bar', 'baz', 'foo'],'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'class': ['a', 'b', 'c', 'd'],'value': [5, 6, 7, 8]})
result = pd.concat([df1, df2], axis=1)
print(result)

Related

Create DataFrame and set_index at once

This works:
import pandas as pd
data = [["aa", 1, 2], ["bb", 3, 4]]
df = pd.DataFrame(data, columns=['id', 'a', 'b'])
df = df.set_index('id')
print(df)
"""
a b
id
aa 1 2
bb 3 4
"""
but is it possible in just one call of pd.DataFrame(...) directly with a parameter, without using set_index after?

Convert values to 2d array:
data = [["aa", 1, 2], ["bb", 3, 4]]
arr = np.array(data)
df = pd.DataFrame(arr[:, 1:], columns=['a', 'b'], index=arr[:, 0])
print (df)
a b
aa 1 2
bb 3 4
Details:
print (arr)
[['aa' '1' '2']
['bb' '3' '4']]
Another solution:
data = [["aa", 1, 2], ["bb", 3, 4], ["cc", 30, 40]]
cols = ['a','b']
L = list(zip(*data))
print (L)
[('aa', 'bb', 'cc'), (1, 3, 30), (2, 4, 40)]
df = pd.DataFrame(dict(zip(cols, L[1:])), index=L[0])
print (df)
a b
aa 1 2
bb 3 4
cc 30 40

Specify columns to output with Pandas Merge function

import pandas as pd
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['dog', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8],
'valuea': [9, 10, 11, 12],
'valueb': [13, 14, 15, 16]})
I would like to merge these 2 dataframes based on 'value'. However I don't want the result to give me all of the columns in df2. I would like to keep the one with the 'valuea' column header but not the one with 'valueb' column header as per the squared output in the image.
The code I have tried is
df1.merge(df2, on ='value')
Is there a way exclude column with header = valueb using parameters in the merge function?

You cannot exclude columns with a parameter in the merge function.
Try these approaches instead:
pd.merge(df1, df2).drop(columns=['valueb'])
pd.merge(df1, df2.drop(columns=['valueb']))

Filling column of dataframe based on 'groups' of values of another column

I am trying to fill values of a column based on the value of another column. Suppose I have the following dataframe:
import pandas as pd
data = {'A': [4, 4, 5, 6],
'B': ['a', np.nan, np.nan, 'd']}
df = pd.DataFrame(data)
And I would like to fill column B but only if the value of column A equals 4. Hence, all rows that have the same value as another in column A should have the same value in column B (by filling this).
Thus, the desired output should be:
data = {'A': [4, 4, 5, 6],
'B': ['a', a, np.nan, 'd']}
df = pd.DataFrame(data)
I am aware of the fillna method, but this gives the wrong output as the third row also gets the value 'A' assigned:
df['B'] = fillna(method="ffill", inplace=True)
data = {'A': [4, 4, 5, 6],
'B': ['a', 'a', 'a', 'd']}
df = pd.DataFrame(data)
How can I get the desired output?

Try this:
df['B'] = df.groupby('A')['B'].ffill()
Output:
>>> df
A B
0 4 a
1 4 a
2 5 NaN
3 6 d

How to convert Pandas data frame to dict with values in a list

I have a huge Pandas data frame with the structure follows as an example below:
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'C', 'C', 'C'], 'col2': [1, 2, 5, 2, 4, 6]})
df
col1 col2
0 A 1
1 A 2
2 B 5
3 C 2
4 C 4
5 C 6
The task is to build a dictionary with elements in col1 as keys and corresponding elements in col2 as values. For the example above the output should be:
A -> [1, 2]
B -> [5]
C -> [2, 4, 6]
Although I write a solution as
from collections import defaultdict
dd = defaultdict(set)
for row in df.itertuples():
dd[row.col1].append(row.col2)
I wonder if somebody is aware of a more "Python-native" solution, using in-build pandas functions.

Without apply we do it by for loop
{x : y.tolist() for x , y in df.col2.groupby(df.col1)}
{'A': [1, 2], 'B': [5], 'C': [2, 4, 6]}

Use GroupBy.apply with list for Series of lists and then Series.to_dict:
d = df.groupby('col1')['col2'].apply(list).to_dict()
print (d)
{'A': [1, 2], 'B': [5], 'C': [2, 4, 6]}

Unique values from multipel column in pandas

distinct_values = df.col_name.unique().compute()
But what if I don't know the names of columns.

I think you need:
df = pd.DataFrame({"colA":['a', 'b', 'b', 'd', 'e'], "colB":[1,2,1,2,1]})
unique_dict = {}
# df.columns will give you list of columns in dataframe
for col in df.columns:
unique_dict[col] = list(df[col].unique())
Output:
{'colA': ['a', 'b', 'd', 'e'], 'colB': [1, 2]}

You can try this,
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 5]})
>>> d = dict()
>>> d['any_column_name'] = pd.unique(df.values.ravel('K'))
>>> d
{'any_column_name': array([1, 2, 3, 5])}
or for just one feature,
>>> d = dict()
>>> d['a'] = df['a'].unique()
>>> d
{'a': array([1, 2, 3])}
or individually for all,
>>> d = dict()
>>> for col in df.columns:
... d[col] = df[col].unique()
...
>>> d
{'a': array([1, 2, 3]), 'b': array([2, 3, 5])}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging 2 csv files - python

import pandas as pd df1 = pd.DataFrame({'name': ['foo', 'bar', 'baz', 'foo'],'value': [1, 2, 3, 5]}) df2 = pd.DataFrame({'class': ['a', 'b', 'c', 'd'],'value': [5, 6, 7, 8]}) result = pd.concat([df1, df2], axis=1) print(result)

Related

Create DataFrame and set_index at once

Specify columns to output with Pandas Merge function

Filling column of dataframe based on 'groups' of values of another column

How to convert Pandas data frame to dict with values in a list

Unique values from multipel column in pandas

Categories

Resources