I'm trying to export dataframes that are iteratively created based on the column value. The idea is that I would use both the column value to dictate the folder as well as filtering the dataframe.
In order to create the dataframes iteratively I'm using exec(). The example follows below. The idea would be to be able to run iteratively the creation of df.to_json('dfName/'+datetime.today().strftime('%d-%m-%Y')+'.json') where the dfName would change iteratively to a, b, c. I'm sorry if this is a duplicate I didn't seem to find anything of sorts so far
from datetime import datetime
import pandas as pd
data1 = ['a', 'a', 'a','b','b','b','c','c','c']
data2 = [1,2,3,4,5,6,7,8,9]
data3 = [10,11,12,13,14,15,16,17,18]
data = {
'Name':data1,
'data2':data2,
'data3':data3}
df = pd.DataFrame(data)
for test in df.Name.unique():
exec(test + "=df[df['Name'] == test]")
You can do it without filters using groupby():
from datetime import datetime
import pandas as pd
data1 = ['a', 'a', 'a','b','b','b','c','c','c']
data2 = [1,2,3,4,5,6,7,8,9]
data3 = [10,11,12,13,14,15,16,17,18]
data = {
'Name':data1,
'data2':data2,
'data3':data3}
df = pd.DataFrame(data)
for name, n_df in df.groupby('Name'):
# do what you need... n_df.to_csv() etc...
print(name)
print(n_df)
Related
I have been trying to normalize this JSON data for quite some time now, but I am getting stuck at a very basic step. I think the answer might be quite simple. I will take any help provided.
import json
import urllib.request
import pandas as pd
url = "https://www.recreation.gov/api/camps/availability/campground/232447/month?start_date=2021-05-01T00%3A00%3A00.000Z"
with urllib.request.urlopen(url) as url:
data = json.loads(url.read().decode())
#data = json.dumps(data, indent=4)
df = pd.json_normalize(data = data['campsites'], record_path= 'availabilities', meta = 'campsites')
print(df)
My Expected df result is as following:
Expected DataFrame Output:
One approach (not using pd.json_normalize) is to iterate through a list of the unique campsites and convert the data for each campsite to a DataFrame. The list of campsite-specific DataFrames can then be concatenated using pd.concat.
Specifically:
## generate a list of unique campsites
unique_campsites = [item for item in data['campsites'].keys()]
## function that returns a DataFrame for each campsite,
## renaming the index to 'date'
def campsite_to_df(data, campsite):
out_df = pd.DataFrame(data['campsites'][campsite]).reset_index()
out_df = out_df.rename({'index': 'date'}, axis = 1)
return out_df
## generate a list of DataFrames, one per campsite
df_list = [campsite_to_df(data, cs) for cs in unique_campsites]
## concatenate the list of DataFrames into a single DataFrame,
## convert campsite id to integer and sort by campsite + date
df_full = pd.concat(df_list)
df_full['campsite_id'] = df_full['campsite_id'].astype(int)
df_full = df_full.sort_values(by = ['campsite_id','date'],
ascending = True)
## remove extraneous columns and rename campsite_id to campsites
df_full = df_full[['campsite_id','date','availabilities',
'max_num_people','min_num_people','type_of_use']]
df_full = df_full.rename({'campsite_id': 'campsites'}, axis = 1)
I am working on my Covid data set from github and I would like to filter my data set with the countries that appear in the this EU_member list in csv format.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
df = df[df.continent == 'Europe']
# From here I want to just pick those countries that appear in the following list:
EU_members= ['Austria','Italy','Belgium''Latvia','Bulgaria','Lithuania','Croatia','Luxembourg','Cyprus','Malta','Czechia','Netherlands','Denmark','Poland','Estonia',
'Portugal','Finland','Romania','France','Slovakia','Germany','Slovenia','Greece','Spain','Hungary','Sweden','Ireland']
# I have tried something like this but it is not what I expected:
df.location.str.find('EU_members')
You can use .isin():
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
EU_members= ['Austria','Italy','Belgium''Latvia','Bulgaria','Lithuania','Croatia','Luxembourg','Cyprus','Malta','Czechia','Netherlands','Denmark','Poland','Estonia',
'Portugal','Finland','Romania','France','Slovakia','Germany','Slovenia','Greece','Spain','Hungary','Sweden','Ireland']
df_out = df[df['location'].isin(EU_members)]
df_out.to_csv('data.csv')
Creates data.csv:
So here is the logic for finding out the difference and I have to merge the results in the format of the screenshot. I am trying to read 2 different files with certain unique ids which could match with the ids present in the second file. And I have to find out a solution to prepare a 3rd file containing the ids with its value in the first file and second file and also display the difference if any. If there is a new value then it should display either 0 or null in the respective cell value.
import pandas as pd
import pandas as pd
import numpy as np
import re
import xlrd
df1_1 = pd.read_table('./first_file.txt', sep = '/t')
df1_1.to_excel('filename1.xlsx')
df_first_file = pd.concat([df1_1['Column'].str.split(' ',expand=True)],axis=1)
df_new1 = df_first_file.to_excel('first_file1.xlsx')
df1 = pd.read_table('./otherfile.txt', sep = '/t')
df1.to_excel('filename2.xlsx')
df_otherfile = pd.concat([df1['Column'].str.split(' ',expand=True)],axis=1)
df_new2 = df_otherfile.to_excel('other_file1.xlsx')
df1 = pd.read_excel('./first_file1.xlsx')
df2 = pd.read_excel('./other_file1.xlsx')
print (df1)
df3= pd.DataFrame()
df3= df2
mydict = dict()
mydict2 = dict()
for i in range(1,df2.shape[0]):
for j in range(1,df1.shape[0]):
if df2[0][i] == df1[0][j]:
mydict[df2[0][i]]= df2[2][j]
for i in range(df1.shape[0]):
if df1[0][i] in mydict :
if str(df1[2][i]).find(',')!=-1 :
a= df1[2][i].replace(',','')
if str(mydict[df1[0][i]]).find(',')!=-1 :
b= mydict[df1[0][i]].replace(',','')
if float(a) > float(b):
mydict2[df1[0][i]]=df1[2][i]
arr=[]
arr.append("NULL")
for i in range(1,df3.shape[0]):
if df3[0][i] in mydict2 :
if str(df3[2][i]).find(',')!=-1 :
a= df3[2][i].replace(',','')
else :
a= df3[2][i]
if str(mydict2[df3[0][i]]).find(',')!=-1 :
b= mydict2[df3[0][i]].replace(',','')
else:
b=mydict2[df3[0][i]]
arr.append(max(float(a),float(b))-min(float(a),float(b)))
else :
arr.append(0.0)
df3['CHANGE']=arr
writer = pd.ExcelWriter('./final.xlsx', engine='xlsxwriter')
df3.to_excel(writer, sheet_name='Sheet1')
writer.save()
Using merge function will work here or not I am still figuring out.
Its working for me:
print (df1)
print(df2)
merge_results = pd.merge(df1,df2,on='IDNUMBER', how='outer', indicator=True)
print(merge_results)
I have 15 samples that each have a unique value of a parameter L.
Each sample was tested and provided data which I have placed into separate DataFrames in Pandas.
Each of the DataFrames has a different number of rows, and I want to place the corresponding value of L in each row, i.e. create a column for parameter L.
Note that L is constant in its respective DataFrame.
Is there a way to write a loop that will take a value of L from a list containing all of its values, and create a column in its corresponding sample data DataFrame?
I have so far been copying and pasting each line, and then updating the values and DataFrame names manually, but I suppose that this is not the most effective way of using python/pandas!
Most of the code I have used so far has been based on what I have found online, and my actual understanding of it is quite limited but I have tried to comment where possible.
UPDATED based on first suggested answer.
import pandas as pd
from pandas import DataFrame
import numpy as np
from pathlib import Path
from glob import glob
from os.path import join
path = r'file-directory/'
data_files = glob(join(path + '*.txt'))
def main():
from contextlib import ExitStack
with ExitStack() as context_manager: # Allows python to access different data folders
files = [context_manager.enter_context(open(f, "r")) for f in data_files]
# Define an empty list and start reading data files
df1 = []
for file in files:
df = pd.read_csv(file,
encoding='utf-8',
skiprows=114,
header=0,
# names=heads,
skipinitialspace=True,
sep='\t'
)
# Process the dataframe to remove unwanted rows and columns, and rename the headers
df = df[df.columns[[1, 2, 4, 6, 8, 10, 28]]]
df = df.drop(0, axis=0)
df = df.reset_index(drop=True)
df.rename(columns=dict(zip(df, heads)), inplace=True)
for columns in df:
df[columns] = pd.to_numeric(df[columns], errors='coerce')
# Append each new dataframe to a new row in the empty dataframe
df1.append(df)
# Extract dataframes from list
data1_0 = df1[0]
data1_1 = df1[1]
data1_2 = df1[2]
data1_3 = df1[3]
data1_4 = df1[4]
data1_5 = df1[5]
data1_6 = df1[6]
data1_7 = df1[7]
data1_8 = df1[8]
data1_9 = df1[9]
data1_10 = df1[10]
data1_11 = df1[11]
data1_12 = df1[12]
data1_13 = df1[13]
data1_14 = df1[14]
# Add in a new column for values of 'L'
L = ['L0', 'L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10', 'L11', 'L12', 'L13', 'L14']
data1_0['L'] = L[0]
data1_1['L'] = L[1]
data1_2['L'] = L[2]
data1_3['L'] = L[3]
data1_4['L'] = L[4]
data1_5['L'] = L[5]
data1_6['L'] = L[6]
data1_7['L'] = L[7]
data1_8['L'] = L[8]
data1_9['L'] = L[9]
data1_10['L'] = L[10]
data1_11['L'] = L[11]
data1_12['L'] = L[12]
data1_13['L'] = L[13]
data1_14['L'] = L[14]
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
The method I am using (copy and paste lines) works so far, it's just that it doesn't seem to be the most efficient use of my time or the tools I have, and I don't really know how to approach this one with my limited experience of python so far.
I also have several other parameters and datasets that I need to do this for, so any help would be greatly appreciated!
You can do just data1_0['L'] = L0 and so on for the rest of DataFrames. Given a single value on such assignment will fill the whole column with that value automatically, no need to compute length/index.
Untested code:
import pandas as pd
from pandas import DataFrame
import numpy as np
from pathlib import Path
from glob import glob
from os.path import join
path = r'file-directory/'
data_files = glob(join(path + '*.txt'))
def main():
from contextlib import ExitStack
with ExitStack() as context_manager: # Allows python to access different data folders
files = [context_manager.enter_context(open(f, "r")) for f in data_files]
# Define an empty list and start reading data files
df1 = []
for file in files:
df = pd.read_csv(file,
encoding='utf-8',
skiprows=114,
header=0,
# names=heads,
skipinitialspace=True,
sep='\t'
)
# Process the dataframe to remove unwanted rows and columns, and rename the headers
df = df[df.columns[[1, 2, 4, 6, 8, 10, 28]]]
df = df.drop(0, axis=0)
df = df.reset_index(drop=True)
df.rename(columns=dict(zip(df, heads)), inplace=True)
for columns in df:
df[columns] = pd.to_numeric(df[columns], errors='coerce')
# Add file name as identifier
df['FNAME'] = os.path.basename(file.name)
# Append each new dataframe to a new row in the empty dataframe
df1.append(df)
# Concatenate the results into single dataframe
data = pd.concat(df1)
L = ['L0', 'L1', 'L2', 'L3', 'L4', 'L5', 'L6', 'L7', 'L8', 'L9', 'L10', 'L11', 'L12', 'L13', 'L14']
# Supposing number of files and length of L is the same
repl_dict = {k:v for k,v in zip([os.path.basename(file.name) for file in files], L)}
# Add the new column
data1['L'] = data.FNAME.map(repl_dict)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
In my environment I have a list of several pandas data frames that are similarly named.
For example:
import pandas as pd
import numpy as np
df_abc = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')
df_xyz = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')
df_2017 = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')
... potentially others
I'd like to create list that automatically figures which data frames are in my environment and pulls them into a list dynamically.
list_of_dfs = ['df_abc','df_xyz','df_2017', 'df_anything else']
# except done dynamically. In this example anything beginning with 'df_'
# list_of_dfs = function_help(begins with 'df_')
globals() should return a dictionary of variable_name:variable_value for the global variables.
If you want a list of defined variables with names starting with 'df_' you could do:
list_of_dfs = [variable for variable in globals().keys()
if variable.startswith('df_')]
I reckon there has to be a better way than storing your dataframes globally, and relying on globals() to fetch their variable names though. Maybe store them all inside a dictionary?:
dataframes = {}
dataframes['df_1'] = pd.DataFrame()
dataframes['df_2'] = pd.DataFrame()
list_of_dfs = dataframes.keys()