How to create multiple dataframe from a excel data table - python

I have extracted this data frame from an excel spreadsheet using pandas library,
after getting the needed columns and,
I have table formatted like this,
REF PLAYERS
0 103368 Andrés Posada Sanmiguel
1 300552 Diego Posada Sanmiguel
2 103304 Roberto Motta Stanziola
3 NaN NaN
4 REF PLAYERS
5 1047012 ANABELLA EISMANN DE AMAYA
6 104701 FERNANDO ENRIQUE AMAYA CASTRO
7 103451 AUGUSTO ANTONIO ALVARADO AZCARRAGA
8 103484 Kevin Adrian Villarreal Kam
9 REF PLAYERS
10 NaN NaN
11 NaN NaN
12 NaN NaN
13 NaN NaN
14 REF PLAYERS
15 NaN NaN
16 NaN NaN
17 NaN NaN
18 NaN NaN
19 REF PLAYERS
I want to create multiple dataframes converting each row [['REF', 'PLAYERS']] to a new dataframe columns.
suggestions are welcomed I also need to preserve the blank spaces. A pandas newbie.

For this to work, you must first read the dataframe from the file differently: set the argument header=None in your pd.read_excel() function. Because now your columns are called "REF" and "PLAYERS", but we would like to group by them.
Then the first column name probably would be "0", and the first line will be as follows, where the df is the name of your dataframe:
# Set unique index for each group
df["group_id"] = (df[0] == "REF").cumsum()
Solution:
# Set unique index for each group
df["group_id"] = (df["name_of_first_column"] == "REF").cumsum()
# Iterate over groups
dataframes = []
for name, group in df.groupby("group_id"):
df_ = group
# promote 1st row to column name
df_.columns = df_.iloc[0]
# and drop it
df_ = df_.iloc[1:]
# drop index column
df_ = df_[["REF", "PLAYERS"]]
# append to the list of dataframes
dataframes.append(df_)
All your multiple dataframes are now stored in an array dataframes.

You can split your dataframe, into equal lengths (in your case 4 rows for each df), using np.split.
Since you want 4 rows per dataframe, you can split it into 5 different df:
import numpy as np
dfs = [df.loc[idx] for idx in np.split(df.index,5)]
And then create your individual dataframes:
df1 = dfs[1]
df1
REF PLAYERS
4 REF PLAYERS
5 1047012 ANABELLA EISMANN DE AMAYA
6 104701 FERNANDO ENRIQUE AMAYA CASTRO
7 103451 AUGUSTO ANTONIO ALVARADO AZCARRAGA
df2 = dfs[2]
df2
REF PLAYERS
8 103484 Kevin Adrian Villarreal Kam
9 REF PLAYERS
10 NaN NaN
11 NaN NaN

Related

Insert new data to dataframe

I have a dataframe
employees = [('Jack', 34, 'Sydney' ) ,
('Riti', 31, 'Delhi' ) ,
('Aadi', 16, 'London') ,
('Mark', 18, 'Delhi' )]
dataFrame = pd.DataFrame( employees,
columns=['Name', 'Age', 'City'])
I would like to append this DataFrame with some new columns. I did it with:
data = ['Height', 'Weight', 'Eyecolor']
duduFrame = pd.DataFrame(columns=data)
This results in:
Name Age City Height Weight Eyecolor
0 Jack 34.0 Sydney NaN NaN NaN
1 Riti 31.0 Delhi NaN NaN NaN
2 Aadi 16.0 London NaN NaN NaN
3 Mark 18.0 Delhi NaN NaN NaN
So far so good.
Now I have new Data about Height, Weight and Eyecolor for "Riti":
Riti_data = [(172, 74, 'Brown')]
This I would like to add to dataFrame.
I tried it with
dataFrame.loc['Riti', [duduFrame]] = Riti_data
But I get the error
ValueError: Buffer has wrong number of dimensions (expected 1, got 3)
What am I doing wrong?
try this :
dataFrame.loc[dataFrame['Name']=='Riti', ['Height','Weight','Eyecolor']] = Riti_data
your mistake I think was not to specify the columns you did : duduFrame instead of the data which contains the name columns you want to add the new value
You can do this :
df = pd.concat([dataFrame, duduFrame])
df = df.set_index('Name')
df.loc['Riti',data] = [172,74,'Brown']
Resulting in :
Age City Height Weight Eyecolor
Name
Jack 34.0 Sydney NaN NaN NaN
Riti 31.0 Delhi 172 74 Brown
Aadi 16.0 London NaN NaN NaN
Mark 18.0 Delhi NaN NaN NaN
Pandas has a pd.concat function, whose role is to concatenate dataframes, either vertically (axis = 0), or in your case horizontally (axis = 1).
However, I personally see merging horizontally more like a pd.merge use-case, which gives you more flexibility on how exactly do you want the merge to happen.
In your case, you want to match Name column, right ?
So I would do it in 2 steps:
Build both dataframes with column Name and their respective data
Merge both dataframes with pd.merge(df1, df2, on = 'Name', how = 'outer')
The how = outer parameter makes sure that you don't lose any data from df1 or df2, in case some Name has data in only one of both dataframes. This will be easier for you to catch errors with your data, and will make you think more in terms of SQL JOIN, which is a necessary way of thinking :).

How can I use groupby to merge rows in Pandas?

I have a dataframe that looks like this:
ID
Name
Major1
Major2
Major3
12
Dave
English
NaN
NaN
12
Dave
NaN
Biology
NaN
12
Dave
NaN
NaN
History
13
Nate
Spanish
NaN
NaN
13
Nate
NaN
Business
NaN
I need to merge rows resulting in this:
ID
Name
Major1
Major2
Major3
12
Dave
English
Biology
History
13
Nate
Spanish
Business
NaN
I know this is possible with groupby but I haven't been able to get it to work correctly. Can anyone help?
If you are intent on using groupby, you could do something like this:
dataframe = dataframe.melt(['ID', 'Name']).dropna()
dataframe = dataframe.groupby(['ID', 'Name', 'variable'])['value'].sum().unstack('variable')
You may have to mess with the column names a bit, but this is what comes to me as a possible solution using groupby.
Use melt and pivot
>>> df.melt(['ID', 'Name']).dropna() \
.pivot(['ID', 'Name'], 'variable', 'value') \
.reset_index().rename_axis(columns=None)
ID Name Major1 Major2 Major3
0 12 Dave English Biology History
1 13 Nate Spanish Business NaN

How to add a word to the end of each string in a specific column (pandas dataframe)

I want to add "NSW" to the end of each town name in a pandas data frame.The dataframe currently looks like this:
0 Parkes NaN
1 Forbes NaN
2 Yanco NaN
3 Orange NaN
4 Narara NaN
5 Wyong NaN
I need every town to also have the word NSW added to it
Try with
df['Name'] = df['Name'] + 'NSW'

Select rows from dataframe which has multi index based on other dataframe columns

I have 'univer' dataframe which has state and region name as columns,
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
and 'statobot' dataframe has state and region name as index,
State RegionName 2008Q3 2009Q2 ratio
AK Anchor Point NaN NaN NaN
Anchorage 296166.666667 271933.333333 1.089115
Fairbanks 249966.666667 225833.333333 1.106863
Homer NaN NaN NaN
Juneau 305133.333333 282666.666667 1.079481
Kenai NaN NaN NaN
Ketchikan NaN NaN NaN
Kodiak NaN NaN NaN
Lakes 257433.333333 257200.000000 1.000907
North Pole 241833.333333 219366.666667 1.102416
Palmer 259466.666667 263800.000000 0.983573
now I want to select rows in 'statobot' dataframe based on 'univer' dataframe, the state and region name must be exact fit, I have tried using
haveuni = statobot[(statobot.index.get_level_values(0).isin(univer['State'])) &(statobot.index.get_level_values(1).isin(univer['RegionName']))]
but the result rows are much more than I expected. Is there any other way to be more exact fit?
The simplest one would be to merge the 2 data frames for intersection.
statbot = statbot.set_index(['State', 'RegionName'])
univer = univer.set_index(['State', 'RegionName'])
print(pd.merge(univer, statbot, left_index=True, right_index=True, how='inner'))
Here I have the same multi-index for the 2 dataframes. If you dont have them indices, you can specify the columns you want to merge on using the left_on and right_on parameters instead of left_index and right_index.
Otherwise there are some other ways using df.loc and cross-section - df.xs on multi-index
Loop through the 'univer' dataframe and find the rows matching the index - 'State' and 'RegionName' in 'statbot' dataframe
for i, row in univer.iterrows():
print(statbot.loc[row['State'], row['RegionName']])
If you probably want the whole row without dropping the index fields then
for i, row in univer.iterrows():
print(statbot.xs((row['State'], row['RegionName']), level=(0, 1), axis=0, drop_level=False))
Another method is use the slicing
statbot.reset_index(inplace=True)
for i, row in univer.iterrows():
print(statbot[(statbot['State']==row['State']) & (statbot['RegionName']==row['RegionName'])])
I hope the pd.merge should work for your case. Let me know how it goes.

Pandas: merge dataframes without creating new columns inside a for operation

I'm trying to enrich a dataframe with data collected from an API.
So, I'm going like this:
for i in df.index:
if pd.isnull(df.cnpj[i]) == True:
pass
else:
k=get_financials_hnwi(df.cnpj[i]) # this is my API requesting function, working fine
df=df.merge(k,on=["cnpj"],how="left") # here is my problem <-------------------------------
Since I'm running that merge in a for sentence, it is showing up suffixes (_x, _y). So I found this alternative here:
Pandas: merge dataframes without creating new columns
for i in df.index:
if pd.isnull(df.cnpj[i]) == True:
pass
else:
k=get_financials_hnwi(df.cnpj[i]) # this is my requesting function, working fine
val = np.intersect1d(df.cnpj, k.cnpj)
df_temp = pd.concat([df,k], ignore_index=True)
df=df_temp[df_temp.cnpj.isin(val)]
However it creates a new df, killing the original index and not allowing this line to run if pd.isnull(df.cnpj[i]) == True:.
Is there a nice way to run a merge/join/concat inside a for operation without creating new columns with _x and _y? Or there is a way to mix _x and _y columns afterall getting rid of it and condensing it in a single column? I just want a single column with all of it
Sample data and reproducible code
df=pd.DataFrame({'cnpj':[12,32,54,65],'co_name':['Johns Market','T Bone Gril','Superstore','XYZ Tech']})
#first API request:
k=pd.DataFrame({'cnpj':[12],'average_revenues':[687],'years':['2019,2018,2017']})
df=df.merge(k,on="cnpj", how='left')
#second API request:
k=pd.DataFrame({'cnpj':[32],'average_revenues':[456],'years':['2019,2017']})
df=df.merge(k,on="cnpj", how='left')
#third API request:
k=pd.DataFrame({'cnpj':[53],'average_revenues':[None],'years':[None]})
df=df.merge(k,on="cnpj", how='left')
#fourth API request:
k=pd.DataFrame({'cnpj':[65],'average_revenues':[4142],'years':['2019,2018,2015,2013,2012']})
df=df.merge(k,on="cnpj", how='left')
print(df)
Result:
cnpj co_name average_revenues_x years_x average_revenues_y \
0 12 Johns Market 687.0 2019,2018,2017 NaN
1 32 T Bone Gril NaN NaN 456.0
2 54 Superstore NaN NaN NaN
3 65 XYZ Tech NaN NaN NaN
years_y average_revenues_x years_x average_revenues_y \
0 NaN None None NaN
1 2019,2017 None None NaN
2 NaN None None NaN
3 NaN None None 4142.0
years_y
0 NaN
1 NaN
2 NaN
3 2019,2018,2015,2013,2012
Desired result:
cnpj co_name average_revenues years
0 12 Johns Market 687.0 2019,2018,2017
1 32 T Bone Gril 456.0 2019,2017
2 54 Superstore None None
3 65 XYZ Tech 4142.0 2019,2018,2015,2013,2012
as your joining on a single column and mapping values we can take advantage of the cnpj column and set it to the index, we can then use combine_first or update or map to add your values into your dataframe.
assuming k will look like this. If not just update the function to return a dictionary that you can use map with.
cnpj average_revenues years
0 12 687 2019,2018,2017
lets hold this in a tidy function.
def update_api_call(dataframe,api_call):
if dataframe.index.name == 'cnpj':
pass
else:
dataframe = dataframe.set_index('cnpj')
return dataframe.combine_first(
api_call.set_index('cnpj')
)
assuming your variable ks are numbered 1-4 for our test.
df1 = update_api_call(df,k1)
print(df1)
average_revenues co_name years
cnpj
12 687.0 Johns Market 2019,2018,2017
32 NaN T Bone Gril NaN
54 NaN Superstore NaN
65 NaN XYZ Tech NaN
df2 = update_api_call(df1,k2)
print(df2)
average_revenues co_name years
cnpj
12 687.0 Johns Market 2019,2018,2017
32 456.0 T Bone Gril 2019,2017
54 NaN Superstore NaN
65 NaN XYZ Tech NaN
print(df4)
average_revenues co_name years
cnpj
12 687.0 Johns Market 2019,2018,2017
32 456.0 T Bone Gril 2019,2017
53 NaN NaN NaN
54 NaN Superstore NaN
65 4142.0 XYZ Tech 2019,2018,2015,2013,2012

Categories

Resources