creating a DataFrame in pandas using a List of lists

creating a DataFrame in pandas using a List of lists - python

I have a list of lists like this.
sports = [['Sport', 'Country(s)'], ['Foot_ball', 'brazil'], ['Volleyball', 'Argentina', 'India'], ['Rugger', 'New_zealand', ‘South_africa’], ['Cricket', 'India'], ['Carrom', 'Uk', ‘Usa’], ['Chess', 'Uk']]
I want to create panda data frame using the above lists as follows:
sport Country(s)
Foot_ball brazil
Volleyball Argentina
Volleyball india
Rugger New_zealnd
Rugger South_africa
Criket India
Carrom UK
Carrom Usa
Chess UK
I was trying like this
sport_x = []
for x in sports[1:]:
sport_x.append(x[0])
print(sport_x)
country = []
for y in sports[1:]:
country.append(y[1:])
header = sports[0]
df = pd.DataFrame([sport_x,country], columns = header)
halfway through, i m getting this error
But i was getting this error.
AssertionError: 2 columns passed, passed data had 6 columns
Any suggestions, how to do this.

Something like this to first "expand" the irregularly shaped rows, then dataframefy them.
>>> sports = [
["Sport", "Country(s)"],
["Foot_ball", "brazil"],
["Volleyball", "Argentina", "India"],
["Rugger", "New_zealand", "South_africa"],
["Cricket", "India"],
["Carrom", "Uk", "Usa"],
["Chess", "Uk"],
]
>>> expanded_sports = []
>>> for row in sports:
... for country in row[1:]:
... expanded_sports.append((row[0], country))
...
>>> pd.DataFrame(expanded_sports[1:], columns=expanded_sports[0])
Sport Country(s)
0 Foot_ball brazil
1 Volleyball Argentina
2 Volleyball India
3 Rugger New_zealand
4 Rugger South_africa
5 Cricket India
6 Carrom Uk
7 Carrom Usa
8 Chess Uk
>>>
EDIT: Another solution using .melt(), but this looks uglier to me, and the order isn't the same.
>>> pd.DataFrame(sports[1:]).melt(0, value_name='country').dropna().drop('variable', axis=1).rename({0: 'sport'}, axis=1)
sport country
0 Foot_ball brazil
1 Volleyball Argentina
2 Rugger New_zealand
3 Cricket India
4 Carrom Uk
5 Chess Uk
7 Volleyball India
8 Rugger South_africa
10 Carrom Usa

Or, pandas way using explode and list comprehension:
df=pd.DataFrame([[i[0],','.join(i[1:])] if len(i)>2 else i for i in sports[1:]],
columns=sports[0])
df['Country(s)']=df['Country(s)'].str.split(',')
final=df.explode('Country(s)').reset_index(drop=True)
Sport Country(s)
0 Foot_ball brazil
1 Volleyball Argentina
2 Volleyball India
3 Rugger New_zealand
4 Rugger South_africa
5 Cricket India
6 Carrom Uk
7 Carrom Usa
8 Chess Uk

Related

Validate a dataframe based on another dataframe?

I have two dataframes:
Table1:
Table2:
How to find:
The country-city combinations that are present only in Table2 but not Table1.
Here [India-Mumbai] is the output.
For each country-city combination, that's present in both the tables, find the "Initiatives" that are present in Table2 but not Table1.
Here {"India-Bangalore": [Textile, Irrigation], "USA-Texas": [Irrigation]}

To answer the first question, we can use the merge method and keep only the NaN rows :
>>> df_merged = pd.merge(df_1, df_2, on=['Country', 'City'], how='left', suffixes = ['_1', '_2'])
>>> df_merged[df_merged['Initiative_2'].isnull()][['Country', 'City']]
Country City
13 India Mumbai
For the next question, we first need to remove the NaN rows from the previously merged DataFrame :
>>> df_both_table = df_merged[~df_merged['Initiative_2'].isnull()]
>>> df_both_table
Country City Initiative_1 Initiative_2
0 India Bangalore Plants Plants
1 India Bangalore Plants Textile
2 India Bangalore Plants Irrigtion
3 India Bangalore Industries Plants
4 India Bangalore Industries Textile
5 India Bangalore Industries Irrigtion
6 India Bangalore Roads Plants
7 India Bangalore Roads Textile
8 India Bangalore Roads Irrigtion
9 USA Texas Plants Plants
10 USA Texas Plants Irrigation
11 USA Texas Roads Plants
12 USA Texas Roads Irrigation
Then, we can filter on the rows that are strictly different on columns Initiative_1 and Initiative_2 and use a groupby to get the list of Innitiative_2 :
>>> df_unique_initiative_2 = df_both_table[~(df_both_table['Initiative_1'] == df_both_table['Initiative_2'])]
>>> df_list_initiative_2 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_2'].unique().reset_index()
>>> df_list_initiative_2
Country City Initiative_2
0 India Bangalore [Textile, Irrigation, Plants]
1 USA Texas [Irrigation, Plants]
We do the same but this time on Initiative_1 to get the list as well :
>>> df_list_initiative_1 = df_unique_initiative_2.groupby(['Country', 'City'])['Initiative_1'].unique().reset_index()
>>> df_list_initiative_1
Country City Initiative_1
0 India Bangalore [Plants, Industries, Roads]
1 USA Texas [Plants, Roads]
To finish, we use the set to remove the last redondant Initiative_1 elements to get the expected result :
>>> df_list_initiative_2['Initiative'] = (df_list_initiative_2['Initiative_2'].map(set)-df_list_initiative_1['Initiative_1'].map(set)).map(list)
>>> df_list_initiative_2[['Country', 'City', 'Initiative']]
Country City Initiative
0 India Bangalore [Textile, Irrigation]
1 USA Texas [Irrigation]

Alternative approach (df1 your Table1, df2 your Table2):
combos_1, combos_2 = set(zip(df1.Country, df1.City)), set(zip(df2.Country, df2.City))
in_2_but_not_in_1 = [f"{country}-{city}" for country, city in combos_2 - combos_1]
initiatives = {
f"{country}-{city}": (
set(df2.Initiative[df2.Country.eq(country) & df2.City.eq(city)])
- set(df1.Initiative[df1.Country.eq(country) & df1.City.eq(city)])
)
for country, city in combos_1 & combos_2
}
Results:
['India-Delhi']
{'India-Bangalore': {'Irrigation', 'Textile'}, 'USA-Texas': {'Irrigation'}}
I think you got this "The country-city combinations that are present only in Table2 but not Table1. Here [India-Mumbai] is the output" wrong: The combinations India-Mumbai is not present in Table2?

How to maintain the same index after sorting a Pandas series?

I have the following Pandas series from the dataframe 'Reducedset':
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
Which gives me:
Country
United States 1.536434e+13
China 6.348609e+12
Japan 5.542208e+12
Germany 3.493025e+12
France 2.681725e+12
United Kingdom 2.487907e+12
Brazil 2.189794e+12
Italy 2.120175e+12
India 1.769297e+12
Canada 1.660647e+12
Russian Federation 1.565459e+12
Spain 1.418078e+12
Australia 1.164043e+12
South Korea 1.106715e+12
Iran 4.441558e+11
dtype: float64
I want to update the index, so that index of the dataframe Reducedset is in the same order as the series above.
How can I do this?
In other words, when I then look at the entire dataframe, the index order should be the same as in the series above and not like that below:
Reducedset
Rank Documents Citable documents Citations \
Country
China 1 127050 126767 597237
United States 2 96661 94747 792274
Japan 3 30504 30287 223024
United Kingdom 4 20944 20357 206091
Russian Federation 5 18534 18301 34266
Canada 6 17899 17620 215003
Germany 7 17027 16831 140566
India 8 15005 14841 128763
France 9 13153 12973 130632
South Korea 10 11983 11923 114675
Italy 11 10964 10794 111850
Spain 12 9428 9330 123336
Iran 13 8896 8819 57470
Australia 14 8831 8725 90765
Brazil 15 8668 8596 60702

The answer:
Reducedset = Top15.iloc[:,10:20].mean(axis=1).sort_values(ascending=False)
This first stage finds the mean of columns 10-20 for each row (axis=1) and sorts them in descending order (ascending = False)
Reducedset.reindex(Reducedset.index)
Here, we are resetting the index of the dataframe 'Reducedset' as the index of the amended dataframe above.

Trying to match up two df's based on three common columns with none of them being identical

I have two df's
df1
date League teams
0 201902272215 brazil cup foz do iguacu fcceara ce
1 201902272300 colombia primera a deportes tolimaatletico bucaramanga
2 201902272300 brazil campeonato gaucho 2nd division ypiranga rsuniao frederiquense
3 201902272300 brazil campeonato gaucho 2nd division esportivo rstupi rs
4 201902272300 brazil campeonato gaucho 2nd division sao paulo rsgremio esportivo bage
14 201902280000 four nations women tournament (in usa) usa (w)japan (w)
25 201902280030 bolivia professional football league real potosibolivar
df2
date league teams
0 201902280000 womens international usa womenjapan women
1 201902280000 brazil amazonense sul america ecrio negro am
2 201902280030 bolivia apertura real potosibolivar
3 201902280030 brazil campeonato paulista palmeirasituano
4 201902280030 copa sudamericana racing clubcorinthians
The result I would want is all the rows from df2 that are near matches with df1
date league teams near_match
0 201902280000 womens international usa womenjapan women 1
1 201902280000 brazil amazonense sul america ecrio negro am 0
2 201902280030 bolivia apertura real potosibolivar 1
3 201902280030 brazil campeonato paulista palmeirasituano 0
4 201902280030 copa sudamericana racing clubcorinthians 0
I have tried to use a variation of a for loop using SequenceMatcher and setting the threshold to a match of above 0.8, but haven't had any luck.
df_1['merge_teams'] = df_1['teams'] # we will use these as the merge keys
df_1['merge_date'] = df_1['date']
# df_1['merge_league'] = df_1['league']
for teams_1, date_1, league_1 in df_1[['teams','date']].values:
for ixb, (teams_1, teams_2) in enumerate(df_2[['teams','date']].values):
if difflib.SequenceMatcher(None,teams_1,teams_2).ratio() > .8:
df_2.ix[ixb,'merge_teams'] = teams_1 # creates a merge key in df_2
if difflib.SequenceMatcher(None,date_1, date_2).ratio() > .8:
df_2.ix[ixb,'merge_date'] = date_1 # creates a merge key in df_2
# This should rturn all rows where teams,date and league all match by over 80%
# This is just for teams and date, I want to include league as well
Any advice or guidance would be greatly appreciated.

How to add a dictionary as the last element to a list of dictionaries?

I would like to add a dictionary to a list, which contains several other dictionaries.
I have a list of ten top travel cities:
City Country Population Area
0 Buenos Aires Argentina 2891000 4758
1 Toronto Canada 2800000 2731571
2 Pyeongchang South Korea 2581000 3194
3 Marakesh Morocco 928850 200
4 Albuquerque New Mexico 559277 491
5 Los Cabos Mexico 287651 3750
6 Greenville USA 84554 68
7 Archipelago Sea Finland 60000 8300
8 Walla Walla Valley USA 32237 33
9 Salina Island Italy 4000 27
10 Solta Croatia 1700 59
11 Iguazu Falls Argentina 0 672
I imported the excel with pandas:
import pandas as pd
travel_df = pd.read_excel('./cities.xlsx')
print(travel_df)
cities = travel_df.to_dict('records')
print(cities)
variables = list(cities[0].keys())
I would like to add a 12th element to the end of the list but don't know how to do so:
beijing = {"City" : "Beijing", "Country" : "China", "Population" : "24000000", "Ares" : "6490" }
print(beijing)

Try appending the new row to the DataFrame you read.
travel_df.append(beijing, ignore_index=True)

how to iterate by loop with values in function using python?

I want to pass values using loop one by one in function using python.Values are stored in dataframe.
def eam(A,B):
y=A +" " +B
return y
Suppose I pass the values of A as country and B as capital .
Dataframe df is
country capital
India New Delhi
Indonesia Jakarta
Islamic Republic of Iran Tehran
Iraq Baghdad
Ireland Dublin
How can I get value using loop
0 India New Delhi
1 Indonesia Jakarta
2 Islamic Republic of Iran Tehran
3 Iraq Baghdad
4 Ireland Dublin

Here you go, just use the following syntax to get a new column in the dataframe. No need to write code to loop over the rows. However, if you must loop, df.iterrows() returns or df.itertuples() provide nice functionality to accomplish similar objectives.
>>> df = pd.read_clipboard(sep='\t')
>>> df.head()
country capital
0 India New Delhi
1 Indonesia Jakarta
2 Islamic Republic of Iran Tehran
3 Iraq Baghdad
4 Ireland Dublin
>>> df.columns
Index(['country', 'capital'], dtype='object')
>>> df['both'] = df['country'] + " " + df['capital']
>>> df.head()
country capital both
0 India New Delhi India New Delhi
1 Indonesia Jakarta Indonesia Jakarta
2 Islamic Republic of Iran Tehran Islamic Republic of Iran Tehran
3 Iraq Baghdad Iraq Baghdad
4 Ireland Dublin Ireland Dublin

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

creating a DataFrame in pandas using a List of lists - python

Related

Validate a dataframe based on another dataframe?

How to maintain the same index after sorting a Pandas series?

Trying to match up two df's based on three common columns with none of them being identical

How to add a dictionary as the last element to a list of dictionaries?

how to iterate by loop with values in function using python?

Categories

Resources