How to convert two rows of data into a single row - python

I want convert below data into one Using pandas
Orginal data
ID Name m1 m2 m3
1 X 2 6 6
1 Y 1 2 3
2 A 2 4 7
2 y 5 6 7
I want To covert into below format using pandas libray
ID Name1 m1 m2 m3 Name2 m1 m2 m3
1 X 2 6 6 Y 1 2 3
2 A 2 4 7 y 6 6 7

Let's assume this is your data:
data = {'ID':[1, 1, 2, 2],
'Name':['X', 'Y', 'A', 'y'],
'm1':[2, 1, 2, 5], 'm2':[6,2,4,6],
'm3':[6, 3, 7, 7] }
df = pd.DataFrame(data)
Step 1: Sort the data by ID:
df = df.sort_values(by=['ID'])
Step 2: drop duplicates and keep the first records
df1 = df.drop_duplicates(subset=['ID'], keep='first')
Step 3: again drop duplicates but keep the last records
df2 = df.drop_duplicates(subset=['ID'], keep='last')
Step 4: finally, merge the two dataframe on the same ID
df = df1.merge(df2, on='ID')
Expected output would be look like:
ID Name_x m1_x m2_x m3_x Name_y m1_y m2_y m3_y
0 1 X 2 6 6 Y 1 2 3
1 2 A 2 4 7 y 5 6 7

Related

Compare dataframes and only use unmatched values

I have two dataframes that I want to compare, but only want to use the values that are not in both dataframes.
Example:
DF1:
A B C
0 1 2 3
1 4 5 6
DF2:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
So, from this example I want to work with row index 2 and 3 ([7, 8, 9] and [10, 11, 12]).
The code I currently have (only remove duplicates) below.
df = pd.concat([di_old, di_new])
df = df.reset_index(drop=True)
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
print(df.reindex(idx))
I would do :
df_n = df2[df2.isin(df1).all(axis=1)]
ouput
A B C
0 1 2 3
1 4 5 6

Pandas: Split columns into multiple columns by two delimiters

I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.
You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2
Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2

create a dataframe from 3 other dataframes in python

I am trying to create a new df which summarises my key information, by taking that information from 3 (say) other dataframes.
dfdate = {'x1': [2, 4, 7, 5, 6],
'x2': [2, 2, 2, 6, 7],
'y1': [3, 1, 4, 5, 9]}
dfdate = pd.DataFrame(df, index=range(0:4))
dfqty = {'x1': [1, 2, 6, 6, 8],
'x2': [3, 1, 1, 7, 5],
'y1': [2, 4, 3, 2, 8]}
dfqty = pd.DataFrame(df2, range(0:4))
dfprices = {'x1': [0, 2, 2, 4, 4],
'x2': [2, 0, 0, 3, 4],
'y1': [1, 3, 2, 1, 3]}
dfprices = pd.DataFrame(df3, range(0:4))
Let us say the above 3 dataframes are my data. Say, some dates, qty, and prices of goods. My new df is to be constructed from the above data:
rng = len(dfprices.columns)*len(dfprices.index) # This is the len of new df
dfnew = pd.DataFrame(np.nan,index=range(0,rng),columns=['Letter', 'Number', 'date', 'qty', 'price])
Now, this is where I struggle to piece my stuff together. I am trying to take all the data in dfdate and put it into a column in the new df. same with dfqty and dfprice. (so 3x5 matricies essentially goto a 1x15 vector and are placed into the new df).
As well as that, I need a couple of columns in dfnew as identifiers, from the names of the columns of the old df.
Ive tried for loops but to no avail, and don't know how to convert a df to series. But my desired output is:
dfnew:
'Lettercol','Numbercol', 'date', 'qty', 'price'
0 X 1 2 1 0
1 X 1 4 2 2
2 X 1 7 6 2
3 X 1 5 6 4
4 X 1 6 8 4
5 X 2 2 3 2
6 X 2 2 1 0
7 X 2 2 1 0
8 X 2 6 7 3
9 X 2 7 5 4
10 Y 1 3 2 1
11 Y 1 1 4 3
12 Y 1 4 3 2
13 Y 1 5 2 1
14 Y 1 9 8 3
where the numbers 0-14 are the index.
letter = letter from col header in DFs
number = number from col header in DFs
next 3 columns are data from the orig df's
(don't ask why the original data is in that funny format :)
thanks so much. my last Q wasn't well received so have tried to make this one better, thanks
Use:
#list of DataFrames
dfs = [dfdate, dfqty, dfprices]
#list comprehension with reshape
comb = [x.unstack() for x in dfs]
#join together
df = pd.concat(comb, axis=1, keys=['date', 'qty', 'price'])
#remove second level of MultiIndex and index to column
df = df.reset_index(level=1, drop=True).reset_index().rename(columns={'index':'col'})
#extract all values without first by indexing [1:] and first letter by [0]
df['Number'] = df['col'].str[1:]
df['Letter'] = df['col'].str[0]
cols = ['Letter', 'Number', 'date', 'qty', 'price']
#change order of columns
df = df.reindex(columns=cols)
print (df)
Letter Number date qty price
0 x 1 2 1 0
1 x 1 4 2 2
2 x 1 7 6 2
3 x 1 5 6 4
4 x 1 6 8 4
5 x 2 2 3 2
6 x 2 2 1 0
7 x 2 2 1 0
8 x 2 6 7 3
9 x 2 7 5 4
10 y 1 3 2 1
11 y 1 1 4 3
12 y 1 4 3 2
13 y 1 5 2 1
14 y 1 9 8 3

Sort all columns of a pandas DataFrame independently using sort_values()

I have a dataframe and want to sort all columns independently in descending or ascending order.
import pandas as pd
data = {'a': [5, 2, 3, 6],
'b': [7, 9, 1, 4],
'c': [1, 5, 4, 2]}
df = pd.DataFrame.from_dict(data)
a b c
0 5 7 1
1 2 9 5
2 3 1 4
3 6 4 2
When I use sort_values() for this it does not work as expected (to me) and only sorts one column:
foo = df.sort_values(by=['a', 'b', 'c'], ascending=[False, False, False])
a b c
3 6 4 2
0 5 7 1
2 3 1 4
1 2 9 5
I can get the desired result if I use the solution from this answer which applies a lambda function:
bar = df.apply(lambda x: x.sort_values().values)
print(bar)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
But this looks a bit heavy-handed to me.
What's actually happening in the sort_values() example above and how can I sort all columns in my dataframe in a pandas-way without the lambda function?
You can use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=0), index=df.index, columns=df.columns)
print (df1)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
EDIT:
Answer with descending order:
arr = df.values
arr.sort(axis=0)
arr = arr[::-1]
print (arr)
[[6 9 5]
[5 7 4]
[3 4 2]
[2 1 1]]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df1)
a b c
0 6 9 5
1 5 7 4
2 3 4 2
3 2 1 1
sort_values will sort the entire data frame by the columns order you pass to it. In your first example you are sorting the entire data frame with ['a', 'b', 'c']. This will sort first by 'a', then by 'b' and finally by 'c'.
Notice how, after sorting by a, the rows maintain the same. This is the expected result.
Using lambda you are passing each column to it, this means sort_values will apply to a single column, and that's why this second approach sorts the columns as you would expect. In this case, the rows change.
If you don't want to use lambda nor numpy you can get around using this:
pd.DataFrame({x: df[x].sort_values().values for x in df.columns.values})
Output:
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5

Assigning value to an observation from table of values

I have a large DataFrame of observations. i.e.
value 1,value 2
a,1
a,1
a,2
b,3
a,3
I now have an external DataFrame of values
_ ,a,b
1 ,10,20
2 ,30,40
3 ,50,60
What will be an efficient way to add to the first DataFrame the values from the indexed table? i.e.:
value 1,value 2, new value
a,1,10
a,1,10
a,2,30
b,3,60
a,3,50
An alternative solution using .lookup(). It's just one line, vectorized solution. suitable for large dataset.
import pandas as pd
import numpy as np
# generate some artificial data
# ================================
np.random.seed(0)
df1 = pd.DataFrame(dict(value1=np.random.choice('a b'.split(), 10), value2=np.random.randint(1, 10, 10)))
df2 = pd.DataFrame(dict(a=np.random.randn(10), b=np.random.randn(10)), columns=['a', 'b'], index=np.arange(1, 11))
df1
Out[178]:
value1 value2
0 a 6
1 b 3
2 b 5
3 a 8
4 b 7
5 b 9
6 b 9
7 b 2
8 b 7
9 b 8
df2
Out[179]:
a b
1 2.5452 0.0334
2 1.0808 0.6806
3 0.4843 -1.5635
4 0.5791 -0.5667
5 -0.1816 -0.2421
6 1.4102 1.5144
7 -0.3745 -0.3331
8 0.2752 0.0474
9 -0.9608 1.4627
10 0.3769 1.5350
# processing: one liner lookup function
# =======================================================
# df1.value2 is the index and df1.value1 is the column
df1['new_values'] = df2.lookup(df1.value2, df1.value1)
Out[181]:
value1 value2 new_values
0 a 6 1.4102
1 b 3 -1.5635
2 b 5 -0.2421
3 a 8 0.2752
4 b 7 -0.3331
5 b 9 1.4627
6 b 9 1.4627
7 b 2 0.6806
8 b 7 -0.3331
9 b 8 0.0474
Assuming your first and second dfs are df and df1 respectively, you can merge on the matching columns and then mask the 'a' and 'b' conditions:
In [9]:
df = df.merge(df1, left_on=['value 2'], right_on=['_'])
a_mask = (df['value 2'] == df['_']) & (df['value 1'] == 'a')
b_mask = (df['value 2'] == df['_']) & (df['value 1'] == 'b')
df.loc[a_mask, 'new value'] = df['a'].where(a_mask)
df.loc[b_mask, 'new value'] = df['b'].where(b_mask)
df
Out[9]:
value 1 value 2 _ a b new value
0 a 1 1 10 20 10
1 a 1 1 10 20 10
2 a 2 2 30 40 30
3 b 3 3 50 60 60
4 a 3 3 50 60 50
You can then drop the additional columns:
In [11]:
df = df.drop(['_','a','b'], axis=1)
df
Out[11]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
Another way is to define a func to perform the lookup:
In [15]:
def func(x):
row = df1[(df1['_'] == x['value 2'])]
return row[x['value 1']].values[0]
df['new value'] = df.apply(lambda x: func(x), axis = 1)
df
Out[15]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50
EDIT
Using #Jianxun Li's lookup works but you have to offset the index as your index is 0 based:
In [20]:
df['new value'] = df1.lookup(df['value 2'] - 1, df['value 1'])
df
Out[20]:
value 1 value 2 new value
0 a 1 10
1 a 1 10
2 a 2 30
3 b 3 60
4 a 3 50

Categories

Resources