how to iterate by loop with values in function using python? - python

I want to pass values using loop one by one in function using python.Values are stored in dataframe.
def eam(A,B):
y=A +" " +B
return y
Suppose I pass the values of A as country and B as capital .
Dataframe df is
country capital
India New Delhi
Indonesia Jakarta
Islamic Republic of Iran Tehran
Iraq Baghdad
Ireland Dublin
How can I get value using loop
0 India New Delhi
1 Indonesia Jakarta
2 Islamic Republic of Iran Tehran
3 Iraq Baghdad
4 Ireland Dublin

Here you go, just use the following syntax to get a new column in the dataframe. No need to write code to loop over the rows. However, if you must loop, df.iterrows() returns or df.itertuples() provide nice functionality to accomplish similar objectives.
>>> df = pd.read_clipboard(sep='\t')
>>> df.head()
country capital
0 India New Delhi
1 Indonesia Jakarta
2 Islamic Republic of Iran Tehran
3 Iraq Baghdad
4 Ireland Dublin
>>> df.columns
Index(['country', 'capital'], dtype='object')
>>> df['both'] = df['country'] + " " + df['capital']
>>> df.head()
country capital both
0 India New Delhi India New Delhi
1 Indonesia Jakarta Indonesia Jakarta
2 Islamic Republic of Iran Tehran Islamic Republic of Iran Tehran
3 Iraq Baghdad Iraq Baghdad
4 Ireland Dublin Ireland Dublin

Related

Validate a dataframe based on another dataframe?

I have two dataframes:
Table1:
Table2:
How to find:
The country-city combinations that are present only in Table2 but not Table1.
Here [India-Mumbai] is the output.
For each country-city combination, that's present in both the tables, find the "Initiatives" that are present in Table2 but not Table1.
Here {"India-Bangalore": [Textile, Irrigation], "USA-Texas": [Irrigation]}
To answer the first question, we can use the merge method and keep only the NaN rows :
>>> df_merged = pd.merge(df_1, df_2, on=['Country', 'City'], how='left', suffixes = ['_1', '_2'])
>>> df_merged[df_merged['Initiative_2'].isnull()][['Country', 'City']]
Country City
13 India Mumbai
For the next question, we first need to remove the NaN rows from the previously merged DataFrame :
>>> df_both_table = df_merged[~df_merged['Initiative_2'].isnull()]
>>> df_both_table
Country City Initiative_1 Initiative_2
0 India Bangalore Plants Plants
1 India Bangalore Plants Textile
2 India Bangalore Plants Irrigtion
3 India Bangalore Industries Plants
4 India Bangalore Industries Textile
5 India Bangalore Industries Irrigtion
6 India Bangalore Roads Plants
7 India Bangalore Roads Textile
8 India Bangalore Roads Irrigtion
9 USA Texas Plants Plants
10 USA Texas Plants Irrigation
11 USA Texas Roads Plants
12 USA Texas Roads Irrigation
Then, we can filter on the rows that are strictly different on columns Initiative_1 and Initiative_2 and use a groupby to get the list of Innitiative_2 :
>>> df_unique_initiative_2 = df_both_table[~(df_both_table['Initiative_1'] == df_both_table['Initiative_2'])]
>>> df_list_initiative_2 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_2'].unique().reset_index()
>>> df_list_initiative_2
Country City Initiative_2
0 India Bangalore [Textile, Irrigation, Plants]
1 USA Texas [Irrigation, Plants]
We do the same but this time on Initiative_1 to get the list as well :
>>> df_list_initiative_1 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_1'].unique().reset_index()
>>> df_list_initiative_1
Country City Initiative_1
0 India Bangalore [Plants, Industries, Roads]
1 USA Texas [Plants, Roads]
To finish, we use the set to remove the last redondant Initiative_1 elements to get the expected result :
>>> df_list_initiative_2['Initiative'] = (df_list_initiative_2['Initiative_2'].map(set)-df_list_initiative_1['Initiative_1'].map(set)).map(list)
>>> df_list_initiative_2[['Country', 'City', 'Initiative']]
Country City Initiative
0 India Bangalore [Textile, Irrigation]
1 USA Texas [Irrigation]
Alternative approach (df1 your Table1, df2 your Table2):
combos_1, combos_2 = set(zip(df1.Country, df1.City)), set(zip(df2.Country, df2.City))
in_2_but_not_in_1 = [f"{country}-{city}" for country, city in combos_2 - combos_1]
initiatives = {
f"{country}-{city}": (
set(df2.Initiative[df2.Country.eq(country) & df2.City.eq(city)])
- set(df1.Initiative[df1.Country.eq(country) & df1.City.eq(city)])
)
for country, city in combos_1 & combos_2
}
Results:
['India-Delhi']
{'India-Bangalore': {'Irrigation', 'Textile'}, 'USA-Texas': {'Irrigation'}}
I think you got this "The country-city combinations that are present only in Table2 but not Table1. Here [India-Mumbai] is the output" wrong: The combinations India-Mumbai is not present in Table2?

Groupby or Transpose?

I got this data
country report_date market_cap_usd
0 Australia 6/3/2020 90758154576
1 Australia 6/4/2020 91897977251
2 Australia 6/5/2020 94558861975
3 Canada 6/3/2020 42899754234
4 Canada 6/4/2020 43597908706
5 Canada 6/5/2020 45287016456
6 United States of America 6/3/2020 1.16679E+12
7 United States of America 6/4/2020 1.15709E+12
8 United States of America 6/5/2020 1.19652E+12
and want to turn it into:
report_date Australia Canada ....
6/3/2020 90758154576 42899754234 ...
6/4/2020 91897977251 43597908706 ...
How can I do this?
Use pivot_table;
# setting minimum example
import pandas
data = pandas.DataFrame({'country': ['Australia', 'Australia', 'Canada', 'Canada'],
'report_data': ['6/3/2020', '6/4/2020', '6/3/2020', '6/4/2020'],
'market_cap_usd': [923740927, 92797294, 20387334, 392738092]
})
# pivot the table
data = data.pivot_table(index='report_data', columns='country')
# drop multi-index column
data.columns = [col[1] for col in data.columns]
Output;
Australia Canada
report_data
6/3/2020 923740927 20387334
6/4/2020 92797294 392738092

How to maintain the same index after sorting a Pandas series?

I have the following Pandas series from the dataframe 'Reducedset':
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
Which gives me:
Country
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
Brazil 2.189794e+12
Italy 2.120175e+12
India 1.769297e+12
Canada 1.660647e+12
Russian Federation 1.565459e+12
Spain 1.418078e+12
Australia 1.164043e+12
South Korea 1.106715e+12
Iran 4.441558e+11
dtype: float64
I want to update the index, so that index of the dataframe Reducedset is in the same order as the series above.
How can I do this?
In other words, when I then look at the entire dataframe, the index order should be the same as in the series above and not like that below:
Reducedset
Rank Documents Citable documents Citations \
Country
China 1 127050 126767 597237
United States 2 96661 94747 792274
Japan 3 30504 30287 223024
United Kingdom 4 20944 20357 206091
Russian Federation 5 18534 18301 34266
Canada 6 17899 17620 215003
Germany 7 17027 16831 140566
India 8 15005 14841 128763
France 9 13153 12973 130632
South Korea 10 11983 11923 114675
Italy 11 10964 10794 111850
Spain 12 9428 9330 123336
Iran 13 8896 8819 57470
Australia 14 8831 8725 90765
Brazil 15 8668 8596 60702
The answer:
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
This first stage finds the mean of columns 10-20 for each row (axis=1) and sorts them in descending order (ascending = False)
Reducedset.reindex(Reducedset.index)
Here, we are resetting the index of the dataframe 'Reducedset' as the index of the amended dataframe above.

How to remove words in pandas data frame column which match with words in another column

I am trying to remove part of string in pandas data frame column which are present (matched) in another column, these values are separated by comma and could be one or more. I want to create a new column with the remaining part of string. Below is the reproducible example and my code so far:
import pandas as pd
df = pd.DataFrame({
'Country' : ['Germany, France, Brazil, India, Russia','Russia, France,
Jamaica, India, China',
'Germany, Russia, Jamaica','Italy, Jamaica'],
'Exclude' : ['France, Brazil','India, Russia','Jamaica','Italy']})
print(df)
Printed data frame:
Country Exclude
0 Germany, France, Brazil, India, Russia France, Brazil
1 Russia, France, Jamaica, India, China India, Russia
2 Germany, Russia, Jamaica Jamaica
3 Italy, Jamaica Italy
I want to create column "Output", which will have the names of countries which are not present in column "Exclude". So i tried:
df['Output'] = df['Country'].replace(to_replace=r'\b'+df['Exclude']+r'\b',
value='',regex=True)
Desired Output:
Country Exclude Output
0 Germany, France, Brazil, India, Russia France, Brazil Germany, India, Russia
1 Russia, France, Jamaica, India, China India, Russia France, Jamaica, China
2 Germany, Russia, Jamaica Jamaica Germany, Russia
3 Italy, Jamaica Italy Jamaica
Which does the half job, like it matches when the text in "Exclude" column is exactly present in "Country" but doesn't work when the sequence is different than what is in "Exclude" column. For example it will not work on second row.
I spent a lot of time and tried few other approaches before posting the question, I found similar question on SO but they do not help in this case.
Please help.
Use set difference of splitted values per rows with apply:
f=lambda x: ', '.join(set(x['Country'].split(', ')).difference(set(x['Exclude'].split(', '))))
df['Out'] = df.apply(f, axis=1)
Or list comprehension with zip:
df['Out'] = ([', '.join(set(a.split(', ')).difference(set(b.split(', '))))
for a, b in zip(df['Country'], df['Exclude'])])
print (df)
Country Exclude \
0 Germany, France, Brazil, India, Russia France, Brazil
1 Russia, France, Jamaica, India, China India, Russia
2 Germany, Russia, Jamaica Jamaica
3 Italy, Jamaica Italy
Out
0 Germany, India, Russia
1 China, France, Jamaica
2 Germany, Russia
3 Jamaica
If order is important:
df['Out'] = [', '.join(x for x in a.split(', ') if x not in set(b.split(', ')))
for a, b in zip(df['Country'], df['Exclude'])]
print (df)
Country Exclude \
0 Germany, France, Brazil, India, Russia France, Brazil
1 Russia, France, Jamaica, India, China India, Russia
2 Germany, Russia, Jamaica Jamaica
3 Italy, Jamaica Italy
Out
0 Germany, India, Russia
1 France, Jamaica, China
2 Germany, Russia
3 Jamaica

Create stacked pandas series from series with list elements

I have a pandas series with elements as list:
import pandas as pd
s = pd.Series([ ['United States of America'],['China', 'Hong Kong'], []])
print(s)
0 [United States of America]
1 [China, Hong Kong]
2 []
How to get a series like the following:
0 United States of America
1 China
1 Hong Kong
I am not sure about what happens to 2.
The following options all return Series. Create a new frame and listify.
pd.DataFrame(s.tolist()).stack()
0 0 United States of America
1 0 China
1 Hong Kong
dtype: object
To reset the index, use
pd.DataFrame(s.tolist()).stack().reset_index(drop=True)
0 United States of America
1 China
2 Hong Kong
dtype: object
To convert to DataFrame, call to_frame()
pd.DataFrame(s.tolist()).stack().reset_index(drop=True).to_frame('countries')
countries
0 United States of America
1 China
2 Hong Kong
If you're trying to code golf, use
sum(s, [])
# ['United States of America', 'China', 'Hong Kong']
pd.Series(sum(s, []))
0 United States of America
1 China
2 Hong Kong
dtype: object
Or even,
pd.Series(np.sum(s))
0 United States of America
1 China
2 Hong Kong
dtype: object
However, like most other operations involving sums of lists operations, this is bad in terms of performance (list concatenation operations are inefficient).
Faster operations are possible using chaining with itertools.chain:
from itertools import chain
pd.Series(list(chain.from_iterable(s)))
0 United States of America
1 China
2 Hong Kong
dtype: object
pd.DataFrame(list(chain.from_iterable(s)), columns=['countries'])
countries
0 United States of America
1 China
2 Hong Kong
Or use:
df = pd.DataFrame(s.tolist())
print(df[0].fillna(df[1].dropna().item()))
Output:
0 United States of America
1 China
2 Hong Kong
Name: 0, dtype: object
Assuming that is list
pd.Series(s.sum())
Out[103]:
0 United States of America
1 China
2 Hong Kong
dtype: object
There is a simpler and probably way less computationally expensive to do that through pandas function explode. See at here. In your case, the answer would be:
s.explode()
Simple as it is! In a case with more columns you can specify which one you would like to "explode" by adding the name of it in literals, for example s.explode('country').

Categories

Resources