How can I create a loop to merge two DataFrames? - python

I have two DataFrames:
name
age
weight
sex
d_type
john
21
56
M
futboll
martha
25
43
F
soccer
esthela
29
53
F
judo
harry
18
72
M
karate
irving
24
61
M
karate
jerry
21
56
M
soccer
john_2
26
69
M
futboll
malina
22
53
F
soccer
And
d_type
impact
founds_in
futboll
high
federal
soccer
medium
state
judo
medium
federal
karate
high
federal
At the end I want a DF like this.
name
age
weight
sex
d_type
impact
founds_in
john
21
56
M
futboll
high
federal
martha
25
43
F
soccer
medium
state
esthela
29
53
F
judo
medium
federal
harry
18
72
M
karate
high
federal
irving
24
61
M
karate
high
federal
jerry
21
56
M
soccer
medium
state
john_2
26
69
M
futboll
high
federal
malina
22
53
F
soccer
medium
state
How can I do this in pandas? I need a loop or it's better try in Linux?

In Python
df1 = pd.DataFrame({'name': ['john', 'martha'],
'age': [21, 25],
'd_type': ['futbol', 'soccer']})
df2 = pd.DataFrame({'d_type': ['futbol', 'soccer'],
'impact': ['high', 'medium'],
'founds_in': ['federal', 'state']})
df1.merge(df2, on = 'd_type').set_index('name')
which gives:
name
age
d_type
impact
founds_in
john
21
futbol
high
federal
martha
25
soccer
medium
state

Related

pandas update specific rows in specific columns in one dataframe based on another dataframe

I have two dataframes, Big and Small, and I want to update Big based on the data in Small, only in specific columns.
this is Big:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating 212
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain Madrid paella 743
and this is small:
>>>ID name country city hobby age
0 12 Melinda Peru Lima eating 24
4 44 Gil Spain Barcelona friends 21
I would like to update the rows in Big based on info from Small, on the ID number. I would also like to change only specific columns, the age and the city, and not the name /country/city....
so the result table should look like this:
>>> ID name country city hobby age
0 12 Meli Peru Lima eating *24*
1 15 Saya USA new-york drinking 34
2 34 Aitel Jordan Amman riding 51
3 23 Tanya Russia Moscow sports 75
4 44 Gil Spain *Barcelona* paella *21*
I know to us eupdate but in this case I don't want to change all the the columns in each row, but only specific ones. Is there way to do that?
Use DataFrame.update by ID converted to index and selecting columns for processing - here only age and city:
df11 = df1.set_index('ID')
df22 = df2.set_index('ID')[['age','city']]
df11.update(df22)
df = df11.reset_index()
print (df)
ID name country city hobby age
0 12 Meli Peru Lima eating 24.0
1 15 Saya USA new-york drinking 34.0
2 34 Aitel Jordan Amman riding 51.0
3 23 Tanya Russia Moscow sports 75.0
4 44 Gil Spain Barcelona paella 21.0

Pandas How to combine two rows in group with complex rules/conditions

I have a dataframe:
import pandas as pd
df = pd.DataFrame({
"ID": ['company A', 'company A', 'company A', 'company B','company B', 'company B', 'company C', 'company C','company C','company C', 'company D', 'company D','company D'],
'Sender': [28, 'remove1', 'flag_source', 56, 28, 312, 'remove2', 'flag_source', 78, 102, 26, 101, 96],
'Receiver': [129, 28, 'remove1', 172, 56, 28, 61, 'remove2', 12, 78, 98, 26, 101],
'Date': ['2020-04-12', '2020-03-20', '2020-03-20', '2019-02-11', '2019-01-31', '2018-04-02', '2020-06-29', '2020-06-29', '2019-11-29', '2019-10-01', '2020-04-03', '2020-01-30', '2019-10-18'],
'Sender_type': ['house', 'temp', 'house', 'house', 'house', 'house', 'temp', 'house', 'house','house','house', 'temp', 'house'],
'Receiver_type': ['house', 'house', 'temp', 'house','house','house','house', 'temp', 'house','house','house','house','temp'],
'Price': [32, 50, 47, 21, 23, 19, 52, 39, 12, 22, 61, 53, 19]
})
The df is like this below:
ID Sender Receiver Date Sender_type Receiver_type Price
0 company A 28 129 2020-04-12 house house 32
1 company A remove1 28 2020-03-20 temp house 50 # combine this row with below
2 company A flag_source remove1 2020-03-20 house temp 47 # combine this row with above
3 company B 56 172 2019-02-11 house house 21
4 company B 28 56 2019-01-31 house house 23
5 company B 312 28 2018-04-02 house house 19
6 company C remove2 61 2020-06-29 temp house 52 # combine this row and below
7 company C flag_source remove2 2020-06-29 house temp 39 # combine this row with above
8 company C 78 12 2019-11-29 house house 12
9 company C 102 78 2019-10-01 house house 22
10 company D 26 98 2020-04-03 house house 61
11 company D 101 26 2020-01-30 temp house 53
12 company D 96 101 2019-10-18 house temp 19
I wish to combine/merge two rows for each group 'ID' (company x) by the following rule: combine the row in 'Sender' that contains a'flag_source' and its above row into one new row. In this new row: the Sender is the flag_source, 'Revceiver' is its above value (remove the two 'remove' values), Date is the above date, Sender_type and Receiver_type are 'house', and 'Price' is the previous above value. Then remove the two rows. For example, for company A, it will combine line 1 and line 2 to generate the new row below:
ID Sender Receiver Date Sender_type Receiver_type Price
company A flag_source 28 2020-03-20 house house 50
Then use this new row to replace the previous two lines. Same rules for the other groups(in this case only apply to company A and C). In the end, I wish to have a result like this:
ID Sender Receiver Date Sender_type Receiver_type Price
0 company A 28 129 2020-04-12 house house 32
1 company A flag_source 28 2020-03-20 house house 50 # new row
2 company B 56 172 2019-02-11 house house 21
3 company B 28 56 2019-01-31 house house 23
4 company B 312 28 2018-04-02 house house 19
5 company C flag_source 61 2020-06-29 house house 52 # new row
6 company C 78 12 2019-11-29 house house 12
7 company C 102 78 2019-10-01 house house 22
8 company D 26 98 2020-04-03 house house 61
9 company D 101 26 2020-01-30 temp house 53
10 company D 96 101 2019-10-18 house temp 19
Hopefully my explanation for the question is clear.
As this is a brief sample, the real case has many data like this, I wrote a loop but very slow and unproductive, so please help if you have any ideas and effective way. Many many thanks for help!
I believe the following is working:
mask = df.Sender == 'flag_source'
df[mask] = df.shift()
df.loc[mask, 'Sender'] = 'flag_source'
df.loc[mask, ['Sender_type','Receiver_type']] = 'house'
df = df[~mask.shift(-1).fillna(False).astype(bool)].reset_index(drop=True)
So the steps are (by line):
make a mask of the rows you need to channge
set those rows equal to the previous row with 'shift'
rewrite Sender for those rows to flag_source
also rewrite Sender_type and Receiver_type
remove the previous rows, by using a shift again on the mask. This seems a little convoluted; you could also do something like a loc against rows that don't contain the string remove
Output:
ID Sender Receiver Date Sender_type Receiver_type Price
0 company A 28 129 2020-04-12 house house 32.0
1 company A flag_source 28 2020-03-20 house house 50.0
2 company B 56 172 2019-02-11 house house 21.0
3 company B 28 56 2019-01-31 house house 23.0
4 company B 312 28 2018-04-02 house house 19.0
5 company C flag_source 61 2020-06-29 house house 52.0
6 company C 78 12 2019-11-29 house house 12.0
7 company C 102 78 2019-10-01 house house 22.0
8 company D 26 98 2020-04-03 house house 61.0
9 company D 101 26 2020-01-30 temp house 53.0
10 company D 96 101 2019-10-18 house temp 19.0

How to change value of a pd.DataFrame based on a condition?

I have Fifa dataset and it includes information about football players. One of the features of this dataset is the value of football players but it is in string form such as "$300K" or "$50M". How can I delete simply these euro and "M, K" symbol and write their values in same units?
import numpy as np
import pandas as pd
location = r'C:\Users\bemrem\Desktop\Python\fifa\fifa_dataset.csv'
_dataframe = pd.read_csv(location)
_dataframe = _dataframe.dropna()
_dataframe = _dataframe.reset_index(drop=True)
_dataframe = _dataframe[['Name', 'Value', 'Nationality', 'Age', 'Wage',
'Overall', 'Potential']]
_array = ['Belgium', 'France', 'Brazil', 'Croatia', 'England',' Portugal',
'Uruguay', 'Switzerland', 'Spain', 'Denmark']
_dataframe = _dataframe.loc[_dataframe['Nationality'].isin(_array)]
_dataframe = _dataframe.reset_index(drop=True)
print(_dataframe.head())
print()
print(_dataframe.tail())
I tried to convert this Value column but I failed. This is what I get
Name Value Nationality Age Wage Overall Potential
0 Neymar €123M Brazil 25 €280K 92 94
1 L. Suárez €97M Uruguay 30 €510K 92 92
2 E. Hazard €90.5M Belgium 26 €295K 90 91
3 Sergio Ramos €52M Spain 31 €310K 90 90
4 K. De Bruyne €83M Belgium 26 €285K 89 92
Name Value Nationality Age Wage Overall Potential
4931 A. Kilgour €40K England 19 €1K 47 56
4932 R. White €60K England 18 €2K 47 65
4933 T. Sawyer €50K England 18 €1K 46 58
4934 J. Keeble €40K England 18 €1K 46 56
4935 J. Lundstram €60K England 18 €1K 46 64
But I want to my output looks like this:
Name Value Nationality Age Wage Overall Potential
0 Neymar 123 Brazil 25 €280K 92 94
1 L. Suárez 97 Uruguay 30 €510K 92 92
2 E. Hazard 90.5 Belgium 26 €295K 90 91
3 Sergio Ramos 52 Spain 31 €310K 90 90
4 K. De Bruyne 83 Belgium 26 €285K 89 92
Name Value Nationality Age Wage Overall Potential
4931 A. Kilgour 0.04 England 19 €1K 47 56
4932 R. White 0.06 England 18 €2K 47 65
4933 T. Sawyer 0.05 England 18 €1K 46 58
4934 J. Keeble 0.04 England 18 €1K 46 56
4935 J. Lundstram 0.06 England 18 €1K 46 64
I do not have enough reputation to flag an answer as a duplicate. However, I believe that this will solve your particular question in addition to providing a solution if there is no "K" or "M" in your string.
You will also need to replace $ with € in the regex.
Convert the string 2.90K to 2900 or 5.2M to 5200000 in pandas dataframe

Pandas: transform column's values in independent columns

I have Pandas DataFrame which looks like following (df_olymic).
I would like the values of column Type to be transformed in independent columns (df_olympic_table)
Original dataframe
In [3]: df_olympic
Out[3]:
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 19
Transformed dataframe
In [5]: df_olympic_table
Out[5]:
Country N_Gold N_Silver N_Bronze
0 USA 46 37 38
1 GB 27 23 17
2 China 26 18 26
3 Russia 19 18 19
What would be the most convenient way to achieve this?
You can use DataFrame.pivot:
df = df.pivot(index='Country', columns='Type', values='Num')
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
Another solution with DataFrame.set_index and Series.unstack:
df = df.set_index(['Country','Type'])['Num'].unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 19 19 18
USA 38 46 37
but if get:
ValueError: Index contains duplicate entries, cannot reshape
need pivot_table with some aggreagte function, by default it is np.mean, but you can use sum, first...
#add new row with duplicates value in 'Country' and 'Type'
print (df)
Country Type Num
0 USA Gold 46
1 USA Silver 37
2 USA Bronze 38
3 GB Gold 27
4 GB Silver 23
5 GB Bronze 17
6 China Gold 26
7 China Silver 18
8 China Bronze 26
9 Russia Gold 19
10 Russia Silver 18
11 Russia Bronze 20 < - changed value to 20
11 Russia Bronze 100 < - add new row with duplicates
df = df.pivot_table(index='Country', columns='Type', values='Num', aggfunc=np.mean)
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37
Or groupby with aggreagting mean and reshape by unstack:
df = df.groupby(['Country','Type'])['Num'].mean().unstack()
print (df)
Type Bronze Gold Silver
Country
China 26 26 18
GB 17 27 23
Russia 60 19 18 < - Russia get ((100 + 20)/ 2 = 60
USA 38 46 37

Count number of counties per state using python {census}

I am troubling with counting the number of counties using famous cenus.csv data.
Task: Count number of counties in each state.
Facing comparing (I think) / Please read below?
I've tried this:
df = pd.read_csv('census.csv')
dfd = df[:]['STNAME'].unique() //Gives out names of state
serr = pd.Series(dfd) // converting to series (from array)
After this, i've tried using two approaches:
1:
df[df['STNAME'] == serr] **//ERROR: series length must match**
2:
i = 0
for name in serr: //This generate error 'Alabama'
df['STNAME'] == name
for i in serr:
serr[i] == serr[name]
print(serr[name].count)
i+=1
Please guide me; it has been three days with this stuff.
Use groupby and aggregate COUNTY using nunique:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('census.csv')
In [3]: unique_counties = df.groupby('STNAME')['COUNTY'].nunique()
Now the results
In [4]: unique_counties
Out[4]:
STNAME
Alabama 68
Alaska 30
Arizona 16
Arkansas 76
California 59
Colorado 65
Connecticut 9
Delaware 4
District of Columbia 2
Florida 68
Georgia 160
Hawaii 6
Idaho 45
Illinois 103
Indiana 93
Iowa 100
Kansas 106
Kentucky 121
Louisiana 65
Maine 17
Maryland 25
Massachusetts 15
Michigan 84
Minnesota 88
Mississippi 83
Missouri 116
Montana 57
Nebraska 94
Nevada 18
New Hampshire 11
New Jersey 22
New Mexico 34
New York 63
North Carolina 101
North Dakota 54
Ohio 89
Oklahoma 78
Oregon 37
Pennsylvania 68
Rhode Island 6
South Carolina 47
South Dakota 67
Tennessee 96
Texas 255
Utah 30
Vermont 15
Virginia 134
Washington 40
West Virginia 56
Wisconsin 73
Wyoming 24
Name: COUNTY, dtype: int64
juanpa.arrivillaga has a great solution. However, the code needs a minor modification.
The "counties" with 'SUMLEV' == 40 or 'COUNTY' == 0 should be filtered. Otherwise, all the number of counties are too big by one.
So, the correct answer should be:
unique_counties = census_df[census_df['SUMLEV'] == 50].groupby('STNAME')['COUNTY'].nunique()
with the following result:
STNAME
Alabama 67
Alaska 29
Arizona 15
Arkansas 75
California 58
Colorado 64
Connecticut 8
Delaware 3
District of Columbia 1
Florida 67
Georgia 159
Hawaii 5
Idaho 44
Illinois 102
Indiana 92
Iowa 99
Kansas 105
Kentucky 120
Louisiana 64
Maine 16
Maryland 24
Massachusetts 14
Michigan 83
Minnesota 87
Mississippi 82
Missouri 115
Montana 56
Nebraska 93
Nevada 17
New Hampshire 10
New Jersey 21
New Mexico 33
New York 62
North Carolina 100
North Dakota 53
Ohio 88
Oklahoma 77
Oregon 36
Pennsylvania 67
Rhode Island 5
South Carolina 46
South Dakota 66
Tennessee 95
Texas 254
Utah 29
Vermont 14
Virginia 133
Washington 39
West Virginia 55
Wisconsin 72
Wyoming 23
Name: COUNTY, dtype: int64
#Bakhtawar - This is a very simple way:
df.groupby(df['STNAME']).count().COUNTY

Categories

Resources