Below is the dataframe
df = pd.DataFrame({'Cust_Pincode':[487551,487551,639207,452001,484661,484661],
'REGIONAL_GROUPING':['WEST I','WEST II','TN II','WEST I','WEST I','WEST II'],
'C_LATITUDE':[22.89831,23.74881,10.72208,22.69875,23.88280,23.88280],
'C_LONGITUDE':[78.75441,79.48472,77.94168,75.88575,80.98250,80.98250],
'Region_dist_lim':[33.577743,33.577743,36.812093,33.577743,33.577743,33.577743]})
Cust_Pincode REGIONAL_GROUPING C_LATITUDE C_LONGITUDE Region_dist_lim
0 487551 WEST I 22.89831 78.75441 33.577743
1 487551 WEST II 23.74881 79.48472 33.577743
2 639207 TN II 10.72208 77.94168 36.812093
3 452001 WEST I 22.69875 75.88575 33.577743
4 484661 WEST I 23.88280 80.98250 33.577743
5 484661 WEST II 23.88280 80.98250 33.577743
I'm trying to write a code which will return unique Cust_Pincode has different REGIONAL_GROUPING. groupby on cust_pincode, regional_grouping and return the dataframe where cust_pincode has multiple regional grouping value. Below is the expected output dataframe
Cust_Pincode REGIONAL_GROUPING
WEST I
0 487551
WEST II
WEST I
1 484661
WEST II
The code which i've written is below
df.groupby(['Cust_Pincode','REGIONAL_GROUPING']).filter(lambda x: len(x) > 1)
The above code is not giving any output
You can try this solution
df = df.groupby(['Cust_Pincode']).filter(lambda x: len(x) > 1)
print(df.groupby(['Cust_Pincode', 'REGIONAL_GROUPING']).first())
Why use filter()?
You can just use first() like this:
df.groupby(['Cust_Pincode','REGIONAL_GROUPING']).first()
Related
Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
i wanted to ask that if in SQL I can do like JOIN ON CASE WHEN, is there a way to do this in Pandas?
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = {"North": ["AT"], "West":["TX","LA"]}
So what i have is 2 dummy dict and i have already converted it to become dataframe, first is the name of the cities with the case,and I'm trying to figure out which region the cities belongs to.
Region|City
North|AT
West|TX
West|LA
None|NY
None|CH
So what i thought in SQL was using left on case when, and if the result is null when join with North region then join with West region.
But if there are 15 or 30 region in some country, it'd be problems i think
Use:
#get City without duplicates
df1 = pd.DataFrame(disease)[['City']].drop_duplicates()
#create DataFrame from region dictionary
region = {"North": ["AT"], "West":["TX","LA"]}
df2 = pd.DataFrame([(k, x) for k, v in region.items() for x in v],
columns=['Region','City'])
#append not matched cities to df2
out = pd.concat([df2, df1[~df1['City'].isin(df2['City'])]])
print (out)
Region City
0 North AT
1 West TX
2 West LA
0 NaN CH
1 NaN NY
If order is not important:
out = df2.merge(df1, how = 'right')
print (out)
Region City
0 NaN CH
1 NaN NY
2 West TX
3 North AT
4 West LA
I'm sorry, I'm not exactly sure what's your expected result, can you express more? if your expected result is just getting the city's region there is no need for conditional joining? for ex: you can transform the city-region table into per city per region per row and direct join with the main df
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = [
{'City':'AT','Region':"North"},
{'City':'TX','Region':"West"},
{'City':'LA','Region':"West"}
]
df = pd.DataFrame(disease)
df_reg = pd.DataFrame(region)
df.merge( df_reg , on = 'City' , how = 'left' )
I have a dataframe df with two columns as follows:
ID Country_pairs
0 X [(France, USA), (USA, France)]
1 Y [(USA, UK), (UK, France), (USA, France)]
I want to output all possible pairs of countries in two columns as follows:
ID Country1 Country2
X France USA
X USA France
Y USA UK
Y UK France
Y USA France
Doing this gives me the output I want:
result = pd.DataFrame()
for index, row in df.iterrows():
x = row['Country_pairs']
temp = pd.DataFrame(data=x, columns=['Country1','Country2'])
temp['PID'] = row['PID']
result = result.append(temp)
print result
The dataframe is over 200 million rows, so this is a very slow since I'm looping. I was wondering if there is a faster solution?
import pandas as pd
# setup the dataframe
df = pd.DataFrame({'ID': ['X', 'Y'],
'Country_Pairs': [[('France', 'USA'), ('USA', 'France')],
[('USA', 'UK'), ('UK', 'France'), ('USA', 'France')]]})
ID Country_Pairs
0 X [(France, USA), (USA, France)]
1 Y [(USA, UK), (UK, France), (USA, France)]
# separate each tuple to its own row with explode
df2 = df.explode('Country_Pairs')
# separate each value in the tuple to its own column
df2[['Country1', 'Counrtry2']] = pd.DataFrame(df2.Country_Pairs.tolist(), index=df2.index)
# delete Country_Pairs
df2.drop(columns=['Country_Pairs'], inplace=True)
ID Country1 Counrtry2
0 X France USA
0 X USA France
1 Y USA UK
1 Y UK France
1 Y USA France
give a trial at this:
explode_df = df.apply(pd.Series.explode)
split_country = explode_df.apply(lambda x: ' '.join(x['Country_pairs']), axis=1).str.split(expand=True)
# whether you would like to combine the results
res = pd.concat([explode_df, split_country], axis=1)
You are looking for .explode()
result = df.explode('Country_pairs')
result["Country1"] = result.Country_pairs.apply(lambda t:t[0])
result["Country2"] = result.Country_pairs.apply(lambda t:t[1])
del result["Country_pairs"]
200 million rows is massive, no point running this computation on Pandas. As suggested in the comments, use Apache spark. or if it's in a database, u could possibly work something there.
The solution I proffer works on small datasets suited to Pandas ... take the data out of Pandas, use the itertools functions - product and chain, and build back the dataframe. It should be reasonably fast, but certainly not for 200 million rows
#using the data provided by #Trenton
df = pd.DataFrame({'ID': ['X', 'Y'],
'Country_Pairs': [[('France', 'USA'), ('USA', 'France')],
[('USA', 'UK'), ('UK', 'France'), ('USA', 'France')]]})
from itertools import product, chain
step1 = chain.from_iterable(product(first, last)
for first, last in
df.to_numpy())
res = pd.DataFrame(((first, *last) for first, last in step1),
columns=['ID', 'Country1', 'Country2'])
res
ID Country1 Country2
0 X France USA
1 X USA France
2 Y USA UK
3 Y UK France
4 Y USA France
So here is my issue, I have a dataframe df with a column "Info" like this:
0 US[edit]
1 Boston(B1)
2 Washington(W1)
3 Chicago(C1)
4 UK[edit]
5 London(L2)
6 Manchester(L2)
I would like to put all the strings containing "[ed]" into a separate column df['state'], the remaining strings should be put into another column df['city']. I wanna do some clean up too and remove things in [] and (). This is what I tried:
for ind, row in df.iterrows():
if df['Info'].str.contains('[ed', regex=False):
df['state']=df['info'].str.split('\[|\(').str[0]
else:
df['city']=df['info'].str.split('\[|\(').str[0]
At the end I would like to have something like this
US Boston
US Washington
US Chicago
UK London
UK Manchester
When I try this I always get "The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()"
Any help? Thanks!!
Use Series.where with forward filling missing values for state column, for city assign Series s and then
filter by boolean indexing with inverted mask by ~:
m = df['Info'].str.contains('[ed', regex=False)
s = df['Info'].str.split('\[|\(').str[0]
df['state'] = s.where(m).ffill()
df['city'] = s
df = df[~m]
print (df)
Info state city
1 Boston(B1) US Boston
2 Washington(W1) US Washington
3 Chicago(C1) US Chicago
5 London(L2) UK London
6 Manchester(L2) UK Manchester
If you want you can also remove original column by adding DataFrame.pop:
m = df['Info'].str.contains('[ed', regex=False)
s = df.pop('Info').str.split('\[|\(').str[0]
df['state'] = s.where(m).ffill()
df['city'] = s
df = df[~m]
print (df)
state city
1 US Boston
2 US Washington
3 US Chicago
5 UK London
6 UK Manchester
I would do:
s = df.Info.str.extract('([\w\s]+)(\[edit\])?')
df['city'] = s[0]
df['state'] = s[0].mask(s[1].isna()).ffill()
df = df[s[1].isna()]
Output:
Info city state
1 1 Boston(B1) Boston US
2 2 Washington(W1) Washington US
3 3 Chicago(C1) Chicago US
5 5 London(L2) London UK
6 6 Manchester(L2) Manchester UK
I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.
Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)