Trying to convert consecutive columns to rows in pandas. Ex: Consecutive column names are sequential numbers along with some strings i.e Key1,Val1,...., KeyN,ValN in DataFrame. You can use below code to generate the dataframe.
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover'],'State': ['Texas', 'Texas', 'Alabama'],'Name':['Aria', 'Penelope', 'Niko'],'Key1':["test1", "test2", "test3"],'Val1':[28, 4, 7],'Key2':["test4", "test5", "test6"],
'Val2':[82, 45, 76],'Key3':["test7", "test8", "test9"],'Val3':[4, 76, 9],'Key4':["test10", "test11", "test12"],'Val4':[97, 66, 10],'Key5':["test13", "test14", "test15"],'Val5':[4, 10, '']},columns=['City', 'State', 'Name', 'Key1', 'Val1', 'Key2', 'Val2', 'Key3', 'Val3', 'Key4', 'Val4', 'Key5', 'Val5'])
I tried melt function as below:
df.melt(id_vars=['City', 'State'], var_name='Column', value_name='Key')
But I got the below output:
The problem is for every key, val column has different rows. The expected output is below:
Use pd.wide_to_long:
pd.wide_to_long(df,['Key', 'Val'],['City', 'State', 'Name'],'No').reset_index()
Output:
City State Name No Key Val
0 Houston Texas Aria 1 test1 28
1 Houston Texas Aria 2 test4 82
2 Houston Texas Aria 3 test7 4
3 Houston Texas Aria 4 test10 97
4 Houston Texas Aria 5 test13 4
5 Austin Texas Penelope 1 test2 4
6 Austin Texas Penelope 2 test5 45
7 Austin Texas Penelope 3 test8 76
8 Austin Texas Penelope 4 test11 66
9 Austin Texas Penelope 5 test14 10
10 Hoover Alabama Niko 1 test3 7
11 Hoover Alabama Niko 2 test6 76
12 Hoover Alabama Niko 3 test9 9
13 Hoover Alabama Niko 4 test12 10
14 Hoover Alabama Niko 5 test15
You are trying to simultaneously melt two columns. pd.wide_to_long handles this situation.
Related
I have this data frame that I am transforming into a pivot table I want to add concatenated columns as the values within the pivot
import pandas as pd
import numpy as np
# creating a dataframe
df = pd.DataFrame({'Student': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'Grade': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'Major': ['Liberal Arts', 'Business', 'Sciences', 'Education', 'Law'],
'Age': [27, 23, 21, 23, 24],
'City': ['Boston', 'Brooklyn', 'Camden', 'Chicago', 'Manhattan'],
'State': ['MA', 'NY', 'NJ', 'IL', 'NY'],
'Years' : [2, 4, 3, 3, 4] })
Displays this table
Student Grade Major Age City State Years
0 John Masters Liberal Arts 27 Boston MA 2
1 Boby Graduate Business 23 Brooklyn NY 4
2 Mina Graduate Sciences 21 Camden NJ 3
3 Peter Masters Education 23 Chicago IL 3
4 Nicky Graduate Law 24 Manhattan NY 4
Concatenated Columns
values = pd.concat([df['Age'],df['Years']], axis=1, ignore_index=True)
Displays this result
0 1
0 27 2
1 23 4
2 21 3
3 23 3
4 24 4
I want to add the concatenated column (values) inside of the pivot table so the table displays the Age and Year in adjacent columns not separate pivot tables
table = pd.pivot_table(df, values =['Age','Years'], index =['Student','City','State'], columns =['Grade', 'Major'], aggfunc = np.sum)
Grade Graduate Masters
Major Business Law Sciences Education Liberal Arts
Student City State
Boby Brooklyn NY 23.0 NaN NaN NaN NaN
John Boston MA NaN NaN NaN NaN 27.0
Mina Camden NJ NaN NaN 21.0 NaN NaN
Nicky Manhattan NY NaN 24.0 NaN NaN NaN
Peter Chicago IL NaN NaN NaN 23.0 NaN
I have dataframe similar to this one:
name hobby date country 5 10 15 20 ...
Toby Guitar 2020-01-19 Brazil 0.1245 0.2543 0.7763 0.2264
Linda Cooking 2020-03-05 Italy 0.5411 0.2213 Nan 0.3342
Ben Diving 2020-04-02 USA 0.8843 0.2333 0.4486 0.2122
...
I want to the the int colmns, duplicate them, and put the int as the new value of the column, something like this:
name hobby date country 5 5 10 10 15 15 20 20...
Toby Guitar 2020-01-19 Brazil 0.1245 5 0.2543 10 0.7763 15 0.2264 20
Linda Cooking 2020-03-05 Italy 0.5411 5 0.2213 10 Nan 15 0.3342 20
Ben Diving 2020-04-02 USA 0.8843 5 0.2333 10 0.4486 15 0.2122 20
...
I'm not sure how to tackle this and looking for ideas
Here is a solution you can try out,
digits_ = pd.DataFrame(
{col: [int(col)] * len(df) for col in df.columns if col.isdigit()}
)
pd.concat([df, digits_], axis=1)
name hobby date country 5 ... 20 5 10 15 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 0.2264 5 10 15 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 0.3342 5 10 15 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 0.2122 5 10 15 20
I'm not sure if it is the best way to organise data with duplicated column names. I would recommend stacking (melting) it into long format.
df.melt(id_vars=["name", "hobby", "date", "country"])
Result
name hobby date country variable value
0 Toby Guitar 2020-01-19 Brazil 5 0.1245
1 Linda Cooking 2020-03-05 Italy 5 0.5411
2 Ben Diving 2020-04-02 USA 5 0.8843
3 Toby Guitar 2020-01-19 Brazil 10 0.2543
4 Linda Cooking 2020-03-05 Italy 10 0.2213
5 Ben Diving 2020-04-02 USA 10 0.2333
6 Toby Guitar 2020-01-19 Brazil 15 0.7763
7 Linda Cooking 2020-03-05 Italy 15 Nan
8 Ben Diving 2020-04-02 USA 15 0.4486
9 Toby Guitar 2020-01-19 Brazil 20 0.2264
10 Linda Cooking 2020-03-05 Italy 20 0.3342
11 Ben Diving 2020-04-02 USA 20 0.2122
You could use the pandas insert(...) function combined with a for loop
import numpy as np
import pandas as pd
df = pd.DataFrame([['Toby', 'Guitar', '2020-01-19', 'Brazil', 0.1245, 0.2543, 0.7763, 0.2264],
['Linda', 'Cooking', '2020-03-05', 'Italy', 0.5411, 0.2213, np.nan, 0.3342],
['Ben', 'Diving', '2020-04-02', 'USA', 0.8843, 0.2333, 0.4486, 0.2122]],
columns=['name', 'hobby', 'date', 'country', 5, 10, 5, 20])
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)
results:
name hobby date country 5 ... 10 5 5 20 20
0 Toby Guitar 2020-01-19 Brazil 0.1245 ... 10 0.7763 5 0.2264 20
1 Linda Cooking 2020-03-05 Italy 0.5411 ... 10 NaN 5 0.3342 20
2 Ben Diving 2020-04-02 USA 0.8843 ... 10 0.4486 5 0.2122 20
[3 rows x 12 columns]
I assumed that all your columns are digits from the 5th, but if not you could add in the for loop an if condition to prevent this :
start_col=4
for i in range(0, len(df.columns)-start_col):
dcol = df.columns[start_col+i*2] # digit col name to duplicate
if type(dcol) is int:
df.insert(start_col+i*2+1, dcol, [dcol]*len(df.index), True)
hope you can help me this.
The df looks like this.
region AMER
country Brazil Canada Columbia Mexico United States
metro Rio de Janeiro Sao Paulo Toronto Bogota Mexico City Monterrey Atlanta Boston Chicago Culpeper Dallas Denver Houston Los Angeles Miami New York Philadelphia Seattle Silicon Valley Washington D.C.
ID
321321 2 1 1 13 15 29 1 2 1 11 6 15 3 2 14 3
23213 3
231 2 2 3 1 5 6 3 3 4 3 3 4
23213 4 1 1 1 4 1 2 27 1
21321 4 2 2 1 14 3 2 4 2
12321 1 2 1 1 1 1 10
123213 2 45 5 1
12321 1
123 1 3 2
I want to get the count of columns that have data per of metro and country per region of all the rows(id/index) and store that count into a new column.
Regards,
RJ
You may want to try
df['new']df.sum(level=0, axis=1)
I've combine two DataFrames into one but can't figure out how to label "state_x" and "state_y" tp "West Coast and "East Coast". I will be plotting them later.
What I have so far:
West_quakes = pd.DataFrame({'state': ['California', 'Oregon', 'Washington', 'Alaska'],
'Occurrences': [18108, 376, 973, 12326]})
East_quakes = pd.DataFrame({'state': ['Maine', 'New Hampshire', 'Massachusetts',
'Connecticut', 'New York', 'New Jersey', 'Pennsylvania', 'Maryland',
'Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida'],
'Occurrences': [36, 13, 10, 5, 35, 10, 14, 2, 28, 17, 32, 14, 1]})
West_quakes.reset_index(drop=True).merge(East_quakes.reset_index(drop=True), left_index=True, right_index=True)
Output:
state_x Occurrences_x state_y Occurrences_y
0 California 18108 Maine 36
1 Oregon 376 New Hampshire 13
2 Washington 973 Massachusetts 10
3 Alaska 12326 Connecticut 5
Other merging methods I've tried but results in syntax error such as:
West_quake.set_index('West Coast', inplace=True)
East_quake.set_index('East Coast', inplace=True)
I'm really lost after searching on Google and searching on here.
Any help would be greatly appreciated.
Thank you.
Maybe you are looking for concat instead:
pd.concat((West_quakes, East_quakes))
gives:
state Occurrences
0 California 18108
1 Oregon 376
2 Washington 973
3 Alaska 12326
0 Maine 36
1 New Hampshire 13
2 Massachusetts 10
3 Connecticut 5
4 New York 35
5 New Jersey 10
6 Pennsylvania 14
7 Maryland 2
8 Virginia 28
9 North Carolina 17
10 South Carolina 32
11 Georgia 14
12 Florida 1
Or:
pd.concat((West_quakes, East_quakes), keys=('West','East'))
which gives:
state Occurrences
West 0 California 18108
1 Oregon 376
2 Washington 973
3 Alaska 12326
East 0 Maine 36
1 New Hampshire 13
2 Massachusetts 10
3 Connecticut 5
4 New York 35
5 New Jersey 10
6 Pennsylvania 14
7 Maryland 2
8 Virginia 28
9 North Carolina 17
10 South Carolina 32
11 Georgia 14
12 Florida 1
Or:
pd.concat((West_quakes, East_quakes), axis=1, keys=('West','East'))
outputs:
West East
state Occurrences state Occurrences
0 California 18108.0 Maine 36
1 Oregon 376.0 New Hampshire 13
2 Washington 973.0 Massachusetts 10
3 Alaska 12326.0 Connecticut 5
4 NaN NaN New York 35
5 NaN NaN New Jersey 10
6 NaN NaN Pennsylvania 14
7 NaN NaN Maryland 2
8 NaN NaN Virginia 28
9 NaN NaN North Carolina 17
10 NaN NaN South Carolina 32
11 NaN NaN Georgia 14
12 NaN NaN Florida 1
Having grouped data, I want to drop from the results groups that contain only a single observation with the value below a certain threshold.
Initial data:
df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
'City' :['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
'Sales' : [13,6,16,8,4,3,1]})
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
Now grouping the data:
df.groupby(['Province', 'City']).sum()
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
MN Winnipeg 3
ON Toronto 13
Windsor 1
QC Montreal 6
Now the part I can't figure out is how to drop provinces with only one city (or generally N observations) with the total sales less then 10. The expected output should be:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1
I.e. MN/Winnipeg and QC/Montreal are gone from the results. Ideally, they won't be completely gone but combined into a new group called 'Other', but this may be material for another question.
you can do it this way:
In [188]: df
Out[188]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [189]: g = df.groupby(['Province', 'City']).sum().reset_index()
In [190]: g
Out[190]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
3 MN Winnipeg 3
4 ON Toronto 13
5 ON Windsor 1
6 QC Montreal 6
Now we will create a mask for those 'provinces with more than one city':
In [191]: mask = g.groupby('Province').City.transform('count') > 1
In [192]: mask
Out[192]:
0 True
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
And cities with the total sales greater or equal to 10 win:
In [193]: g[(mask) | (g.Sales >= 10)]
Out[193]:
Province City Sales
0 AL Calgary 8
1 AL Edmonton 4
2 BC Vancouver 16
4 ON Toronto 13
5 ON Windsor 1
I wasn't satisfied with any of the answers given, so I kept chipping at this until I figured out the following solution:
In [72]: df
Out[72]:
City Province Sales
0 Toronto ON 13
1 Montreal QC 6
2 Vancouver BC 16
3 Calgary AL 8
4 Edmonton AL 4
5 Winnipeg MN 3
6 Windsor ON 1
In [73]: df.groupby(['Province', 'City']).sum().groupby(level=0).filter(lambda x: len(x)>1 or x.Sales > 10)
Out[73]:
Sales
Province City
AL Calgary 8
Edmonton 4
BC Vancouver 16
ON Toronto 13
Windsor 1