Iterating through dataframes in a list using for loops - python

I am new to pandas/python. So i am reading a .xlsx file and in that i created bunch of dataframes, 16 to be precise and a master dataframe which is empty. Now I want to append all of these 16 dataframes to the master dataframe one by one, using for loops.
1 method I thought of iterating through a list. But can these df_1, df_2 etc be stored in a list, and then we can iterate over them.
Let's say suppose i had a csv file then,
df1 = pd.read_csv('---.csv')
df2 = pd.read_csv('---.csv')
then i create a list,
filenames = ['---.csv','---.csv']
create an empty master dataframe :
master_df= []
finally, loop through the list :
for f in filenames:
master_df.append(pd.read_csv(f))
but this wont apply, i need something similar, so how can i iterate over all the dataframes. Any solution would be appreciated.
FINALLY, this is my master_df :
master_df = pd.DataFrame({'Variable_Name': [], 'Value':[], 'Count': []})
and this is the 1st dataframe :
df_1 = pd.DataFrame({
'Variable_Name': ['Track', 'Track', 'Track', 'Track'],
'Value': ['Track 38','Track 39', 'Track 40', 'Track 37'],
'Count': [161, 160, 158, 152]})
Similarly 15 more are there.

This is because append() returns new dataframe and this object should be stored somewhere
Try:
for f in filenames:
master_df = master_df.append(pd.read_csv(f))
More info of append function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html

Related

Flatten a Dataframe that is pivoted

I have the following code that is taking a single column and pivoting it into multiple columns. There are blanks in my result that I am trying to remove but I am running into issues with the wrong values being applied to rows.
task_df = task_df.pivot(index=pivot_cols, columns='Field')['Value'].reset_index()
task_df[['Color','Class']] = task_df[['Color','Class']].bfill()
task_df[['Color','Class']] = task_df[['Color','Class']].ffill()
task_df = task_df.drop_duplicates()
Start
Current
Desired
This is basically merging all rows having the same name or id together. You can do it with this:
mergers = {'ID': 'first', 'Color': 'sum', 'Class': 'sum'}
task_df = task_df.groupby('Name', as_index=False).aggregate(mergers).reindex(columns=task_df.columns).sort_values(by=['ID'])

Python - Generating new larger dataset from existing dataset, looping row

I looked at both How do I generate a new set of values from existing dataset and generate data by using existing dataset as the base dataset neither fullfill mig needs, so I read a ton off looping answers, but that didn't get me all the way.
I have the traditional adult dataset. After cleaning it and saving some for validation, so it look like this:
Adult dataset - 43958 rows and 12 colums
I want to run a loop that takes each row and adds a new row where age is increased by 1, but keeps all other data equal to that of the row.
I have tried two diffrent ways.
Nr 1:
df1 = newDataFrame
#iterate through each row of dataframe
for index, row in df1.iterrows():
new_row ={'age':index+1 , 'workclass':[], 'education':[], 'educational-num':[], 'marital-status':[],'occupation':[],
'race':[], 'gender':[], 'capital-gain':[], 'capital-loss':[],'hours-per-week':[], 'income':[]}
print(new_row)
But that gives me:
{'age': 35596, 'workclass': [], 'education': [], 'educational-num': [], 'marital-status': [], 'occupation': [], 'race': [], 'gender': [], 'capital-gain': [], 'capital-loss': [], 'hours-per-week': [], 'income': []}
I also tried:
df1 = newDataFrame
colums =list(df1)
#iterate through each row of dataframe
for index, row in df1.iterrows():
values = [([0]+1),[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]]
zipped =zip(colums, values)
a_dictionary = dict(zipped)
print(a_dictionary)
But get error
> TypeError: can only concatenate list (not "int") to list
I understand that it is becuase of the colums = list but I don't know how to change it. Tried some append() but that didn't help.
So after two days I turn to you.
To goal is to make the dataset bigger, but keeping a strong correlation between values.
Perfect, thanks #gofvonx!
I hade to make a sammal change but this worked
df1 = newDataFrame
df_new= df1.copy()
df_new.age += 1
pd.concat([df1, df_new], axis=0, ignore_index=True)
Your code above has some issues. E.g., new_row is overwritten at each iteration without storing the previous value.
But you do not need to use a loop. You can try
df_new = df1.copy()
df_new['age'] += 1
pd.concat([df1, df_new], axis=0, ignore_index=True)
Note that ignore_index=True will create a new index 0,...,n-1 (see the documentation here).

How to loop through a list of Dataframes

I have about 13 dataframes. I need to write all this into csv. So I thought I got use a for loop.
For example:
data1 = pd.Dataframe({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]})
data2 = pd.Dataframe({'Name':['ABC', 'EFG', 'HIJ', 'LMN'],'Age':[2,3,9,4]})
..
data13 = ....
list_df = [data1, data2,.....,data13]
for i in list_df:
list_df[i].to_csv(...)
But it says a list can't have dataframes Error. What can I do to loop through the variable name of df?
for i in list_df:
i.to_csv(...)
here the variable i is the individual dataframe in the list, but i think you are thinking it to be index which is not the case.
for i,x in enumerate(list_df):
list_df[i].to_csv(...)
this would work.
Just do this:
for i in list_df:
i.to_csv(...)
Your i references the dataframe. You can do one of the following ways:
for i in list_df:
i.to_csv(...)
or:
for i,df in encounter(list_df):
list_df[i].to_csv(...)

Converting list of dfs from pd.read_html into dfs with pandas

Is there a way to modify pd.read_html such that it returns a dataframe instead of a list of dataframes?
Context:
I am trying to use pandas read_html to import tables from a website. I understand that pd.read_html returns a list of dfs instead of individual dataframes. I've been circumventing this by assigning the first (and only dataframe) in the list returned from pd.read_html to a new variable. However, I want to store multiple dataframes from different urls in a master dictionary (using the code below) and would like the values to be dataframe elements, not lists.
urls_dict = {
'2017': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2017',
'2016': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2016',
'2015': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2015',
'2014': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2014',
'2013': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2013',
'2012': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2012',
'2011': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2011',
'2010': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2010',
'2009': 'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year=2009'
}
dfs_dict = {}
for key, url in urls_dict.items():
dfs_dict[key] = pd.read_html(url)
Use a list comprehension inside of pd.concat to concatenate the dataframes for each year (use .assign(year=year) to add the respective years as a column).
Note that pd.read_html(url) returns a list of dataframes. For the given urls, the length of the list is never more than one, so use pd.read_html(url)[0] to access the actual dataframe, then assign the year as a column.
dfs = pd.concat([pd.read_html(url)[0].assign(year=year) for year, url in urls_dict.items()])
Note that you can create urls_dict using the following dictionary comprehension together with f-strings (formatted string literals, introduced in Python 3.6):
years = range(2009, 2018)
urls_dict = {
str(year): f'https://postgrad.sgu.edu/ResidencyAppointmentDirectory.aspx?year={year}'
for year in years
}
IIUC, we can make a slight edit to your code and call pd.concat to concat all calls you make with pd.read_html
dfs = {} # initlaise the loop.
# acess the key and values of a dictionary.
# in {'2017' : [1,2,3]} 2017 is the key and [1,2,3] are the values.
for key, url in urls_dict.items():
# for each unique item in your dict, read in the url and concat the list using pd.concat
dfs[key] =(pd.concat(pd.read_html(url)))
dfs[key]['grad_year'] = key # if you want to assign the key to a column.
dfs[key] = dfs[key].drop('PGY',axis=1) # drop PGY.
print(dfs['2017'].iloc[:5,:3])
PGY Type Name
0 PGY-1 Categorical Van Denakker, Tayler
1 PGY-1 Preliminary Bisharat-Kernizan, Jumana
2 PGY-1 Preliminary Schiffenhaus, James
3 PGY-1 Categorical Collins, Kelsey
4 PGY-1 Categorical Saker, Erfanul
type(dfs['2017'])
pandas.core.frame.DataFrame

Pandas read_csv into multiple DataFrames

I have some data in text file that I am reading into Pandas. A simplified version of the txt read in is:
idx_level1|idx_level2|idx_level3|idx_level4|START_NODE|END_NODE|OtherData...
353386066294006|1142|2018-09-20T07:57:26Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:26Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:26Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:31Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:31Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:31Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:36Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:36Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:36Z|3|18260005359901|18260004567689|...
353386066736543|22|2018-04-17T07:08:23Z||||...
353386066736543|22|2018-04-17T07:08:24Z||||...
353386066736543|22|2018-04-17T07:08:25Z||||...
353386066736543|22|2018-04-17T07:08:26Z||||...
353386066736543|403|2018-07-02T16:55:07Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:07Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|6|18260004283163|18260006215338|...
353386066736543|403|2018-07-02T16:55:01Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:01Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|6|18260004283163|18260006215338|...
And the code I use to read in is as follows:
mydata = pd.read_csv('/myloc/my_simple_data.txt', sep='|',
dtype={'idx_level1': 'int',
'idx_level2': 'int',
'idx_level3': 'str',
'idx_level4': 'float',
'START_NODE': 'str',
'END_NODE': 'str',
'OtherData...': 'str'},
parse_dates = ['idx_level3'],
index_col=['idx_level1','idx_level2','idx_level3','idx_level4'])
What I really want to do is have a seperate panadas DataFrames for each unique idx_level1 & idx_level2 value. So in the above example there would be 3 DataFrames pertaining to idx_level1|idx_level2 values of 353386066294006|1142, 353386066736543|22 & 353386066736543|403 respectively.
Is it possible to read in a text file like this and output each change in idx_level2 to a new Pandas DataFrame, maybe as part of some kind of loop? Alternatively, what would be the most efficient way to split mydata into DataFrame subsets, given that everything I have read suggests that it is inefficient to iterate through a DataFrame.
Read your dataframe as you are currently doing then groupby and use list comprehension:
group = mydata.groupby(level=[0,1])
dfs = [group.get_group(x) for x in group.groups]
you can call your dataframes by doing dfs[0] and so on
To specifically address your last paragraph, you could create a dict of dfs, based on unique values in the column using something like:
import copy
dict = {}
cols = df[column].unique()
for value in col_values:
key = 'df'+str(value)
dict[key] = copy.deepcopy(df)
dict[key] = dict[key][df[column] == value]
dict[key].reset_index(inplace = True, drop = True)
where column = idx_level2
Read the table as-it-is and use groupby, for instance:
data = pd.read_table('/myloc/my_simple_data.txt', sep='|')
groups = dict()
for group, subdf in data.groupby(data.columns[:2].tolist()):
groups[group] = subdf
Now you have all the sub-data frames in a dictionary whose keys are a tuple of the two indexers (eg: (353386066294006, 1142))

Categories

Resources