Reformatting values in a pandas row into associated columns - python

I have a dataframe which looks like:
ID1 ID2 Issues Value1 Value2 IssueDate
1 1 1 56.85490855 9.489650847 02/12/2015
1 1 2 89.55441203 23.60227363 07/02/2015
1 2 1 21.8456428 23.37353082 01/10/2015
2 2 1 55.10795933 1.928443984 13/08/2015
2 2 2 10.22459873 24.44298882 07/04/2015
4 1 1 55.29748656 6.308424035 19/02/2015
and I want it to be multiple dataframes (this is Value1, but imagine a second for 2) which looks like:
Value 1
2015_1 2015_2 2015_3 2015_4 2015_5 2015_6 2015_7I 2015_8 2015_9 2015_10 2015_11 2015_12
ID1 ID2
1 1 89.55441203 56.85490855
1 2 21.8456428
2 2 10.22459873 55.10795933
4 1 55.29748656
The only way I can work out how to do this is to use a lambda function and add values in specific ranges to the associated columns. The problem is that my dataset is very large and trying to complete this movement line by line looping for each possible month/year combination will take a very long period of time.
Is there clever way to use masks or melts to reformat the data into the tables I am looking for?

I guess you are looking for something like this
df.IssueDate = pd.to_datetime(df.IssueDate)
df['Date'] = df.IssueDate.dt.year.astype(str) + '_' + df.IssueDate.dt.month.astype(str)
pd.pivot_table(df[['ID1', 'ID2', 'Value1', 'Date']], columns='Date', index=['ID1', 'ID2'])

Related

extract multiple sub-fields from Pandas dataframe column into a new dataframe

I have a Pandas dataframe (approx 100k rows) as my input. It is an export from a database, and each of the fields in one of the columns contain one or more records which I need to expand into independent records. For example:
record_id
text_field
0
r0_sub_record1_field1#r0_sub_record1_field2#r0_sub_record2_field1#r0_sub_record2_field2#
1
sub_record1_field1#sub_record1_field2#
2
sub_record1_field1#sub_record1_field2#sub_record2_field1#sub_record2_field2#sub_record3_field1#sub_record3_field2#
The desired result should look like this:
record_id
field1
field2
original_record_id
0
r0_sub_record1_field1
r0_sub_record1_field2
0
1
r0_sub_record2_field1
r0_sub_record2_field2
0
2
r1_sub_record1_field1
r1_sub_record1_field2
1
3
r2_sub_record1_field1
r2_sub_record1_field2
2
4
r2_sub_record2_field1
r2_sub_record2_field2
2
5
r2_sub_record3_field1
r2_sub_record3_field2
2
It is quite straight-forward how to extract the data I need using a loop, but I suspect it is not the most efficient and also not the nicest way.
As I understand it, I cannot use apply or map here, because I am building another dataframe with the extracted data.
Is there a good Python-esque and Panda-style way to solve the problem?
I am using Python 3.7 and Pandas 1.2.1.
I think you need to explode based on # then split the # text.
df1 = df.assign(t=df['text_field'].str.split('#')
).drop('text_field',1).explode('t').reset_index(drop=True)
df2 = df1.join(df1['t'].str.split('#',expand=True)).drop('t',1)
print(df2.dropna())
record_id 0 1
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
3 1 sub_record1_field1 sub_record1_field2
5 2 sub_record1_field1 sub_record1_field2
6 2 sub_record2_field1 sub_record2_field2
7 2 sub_record3_field1 sub_record3_field2
Is it what you expect?
out = df['text_field'].str.strip('#').str.split('#').explode() \
.str.split('#').apply(pd.Series)
prefix = 'r' + out.index.map(str) + '_'
out.apply(lambda v: prefix + v).reset_index() \
.rename(columns={0: 'field1', 1: 'field2', 'index': 'original_record_id'})
>>> out
original_record_id field1 field2
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
2 1 r1_sub_record1_field1 r1_sub_record1_field2
3 2 r2_sub_record1_field1 r2_sub_record1_field2
4 2 r2_sub_record2_field1 r2_sub_record2_field2
5 2 r2_sub_record3_field1 r2_sub_record3_field2

using pandas, extract data from long format df and add it to wide format df

I have two dataframes, df1 and df2. df1 has repeat observations arranged in wide format, and df2 in long format.
import pandas as pd
df1 = pd.DataFrame({"ID":[1,2,3],"colA_1":[1,2,3],"date1":["1.1.2001", "2.1.2001","3.1.2001"],"colA_2":[4,5,6],"date2":["1.1.2002", "2.1.2002","3.1.2002"]})
df2 = pd.DataFrame({"ID":[1,1,2,2,3,3],"col1":[1,1.5,2,2.5,3,3.5],"date":["1.1.2001", "1.1.2002","2.1.2001","2.1.2002","3.1.2001","3.1.2002"], "col3":[11,12,13,14,15,16],"col4":[21,22,23,24,25,26]})
df1 looks like:
ID colA_1 date1 colA_2 date2
0 1 1 1.1.2001 4 1.1.2002
1 2 2 2.1.2001 5 2.1.2002
2 3 3 3.1.2001 6 3.1.2002
df2 looks like:
ID col1 date1 col3 col4
0 1 1.0 1.1.2001 11 21
1 1 1.5 1.1.2002 12 22
2 2 2.0 2.1.2001 13 23
3 2 2.5 2.1.2002 14 24
4 3 3.0 3.1.2001 15 25
5 3 3.5 3.1.2002 16 26
6 3 4.0 4.1.2002 17 27
I want to take a given column from df2, "col3", and then:
(1) if the columns "ID" and "date" in df2 match with the columns "ID" and "date1" in df1, I want to put the value in a new column in df1 called "colB_1".
(2) else if the columns "ID" and "date" in df2 match with the columns "ID" and "date2" in df1, I want to put the value in a new column in df1 called "colB_2".
(3) else if the columns "ID" and "date" in df2 have no match with either ("ID" and "date1") or ("ID" and "date2"), I want to ignore these rows.
So, the output of this output dataframe, df3, should look like this:
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16
What is the best way to do this?
I found this link, but the answer doesn't work for my case. I would like a really explicit way to specify column matching. I think it's possible that df.mask might be able to help me, but I am not sure how to implement it.
e.g.: the following code
df3 = df1.copy()
df3["colB_1"] = ""
df3["colB_2"] = ""
filter1 = (df1["ID"] == df2["ID"]) & (df1["date1"] == df2["date"])
filter2 = (df1["ID"] == df2["ID"]) & (df1["date2"] == df2["date"])
df3["colB_1"] = df.mask(filter1, other=df2["col3"])
df3["colB_2"] = df.mask(filter2, other=df2["col3"])
gives the error
ValueError: Can only compare identically-labeled Series objects
I asked this question previously, and it was marked as closed; my question was marked as a duplicate of this one. However, this is not the case. The answers in the linked question suggest the use of either map or df.merge. Map does not work with multiple conditions (in my case, ID and date). And df.merge (the answer given for matching multiple columns) does not work in my case when one of the column names in df1 and df2 that are to be merged are different ("date" and "date1", for example).
For example, the below code:
df3 = df1.merge(df2[["ID","date","col3"]], on=['ID','date1'], how='left')
fails with a Key Error.
Also noteworthy is that I will be dealing with many different files, with many different column naming schemes, and I will need a different subset each time. This is why I would like an answer that explicitly names the columns and conditions.
Any help with this would be much appreciated.
You can the pd.wide_to_long after replacing the underscore , this will unpivot the dataframe which you can use to merge with df2 and then pivot back using unstack:
m =df1.rename(columns=lambda x: x.replace('_',''))
unpiv = pd.wide_to_long(m,['colA','date'],'ID','v').reset_index()
merge_piv = (unpiv.merge(df2[['ID','date','col3']],on=['ID','date'],how='left')
.set_index(['ID','v'])['col3'].unstack().add_prefix('colB_'))
final = df1.merge(merge_piv,left_on='ID',right_index=True)
ID colA_1 date1 colA_2 date2 colB_1 colB_2
0 1 1 1.1.2001 4 1.1.2002 11 12
1 2 2 2.1.2001 5 2.1.2002 13 14
2 3 3 3.1.2001 6 3.1.2002 15 16

python pandas - transforming table

I would like to transform a table which looks similiar to this below:
X|Y|Z|
1|2|3|
3|5|2|
4|2|1|
The result, I want to achive, should look like that:
col|1|2|3|4|5|
X |1|0|1|0|0|
Y |0|2|0|0|1|
Z |1|1|1|0|0|
So, after transformation the new columns should be unique values from previous table, the new values should be populated with count/appearance, and in the index should be the old column names.
I got stuck and i do not know hot to handle with cause I am a newbe in python, so thanks in advance for support.
Regards,
guddy_7
Use apply with value_counts, replace missing values to 0 and transpose by T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
1 2 3 4 5
X 1 0 1 1 0
Y 0 2 0 0 1
Z 1 1 1 0 0

Pandas DataFrames: Extract Information and Collapse Columns

I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?
Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4

convert columns values into rows with pandas

I have a pd.DataFrame where I want to convert some columns into rows. I have the example below where I have 2 different sample with multiple target measurements. I want to break the targets ['t1', 't2', 't3'] and split them into new target row with the sample number. Is there a better way than a for-loop to convert a series of values (in columns) into rows ?
#The entry I have:
pd.DataFrame({'Sample':[0,1],
't1':[2,3],
't2':[4,5],
't3':[6,7]})
# the output I'm expecting:
pd.DataFrame({'Sample':[0,0,0,1,1,1],
'targets':[2,4,6,3,5,7]})
I don't think that the pd.pivot_table() can do that for me.
Does anyone have an idea ?
You are looking for melt
pd.DataFrame({'Sample':[0,1],
't1':[2,3],
't2':[4,5],
't3':[6,7]}).melt('Sample')
Out[74]:
Sample variable value
0 0 t1 2
1 1 t1 3
2 0 t2 4
3 1 t2 5
4 0 t3 6
5 1 t3 7

Categories

Resources