The end goal is to read multiple .cvs files into multiple DataFrames with certain names.
I want to be able to refer to my DataFrame by the name of the city for further analysis and manipulate them separately. So it is important to achieve that and not keep them in a dictionary. But what ends up happening is that the last item in the dict gets assigned to every variable, so I get differently names dfs created but they all have the same data.
lst0 = ['/User1/Research/comp_dataset/yutas_tg/Annapolis__MD_ALL.csv',
'/User1/Research/comp_dataset/yutas_tg/Apalachicola__FL_ALL.csv',
'/User1/Research/comp_dataset/yutas_tg/Atlantic_City__NJ_ALL.csv']
names_3 = ['annapolis','apalachicola','atlantic_city']
d = {}
for fname in lst0:
d[fname] = pd.read_csv(fname)
for nm in names_3:
for fname in lst0:
globals()[nm] = d[fname]
What am I doing wrong?
Thank you!
Your variable naming makes no sense to me. Please name them something relevant to the values they hold.
As to your problem:
paths = [
"/User1/Research/comp_dataset/yutas_tg/Annapolis__MD_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Apalachicola__FL_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Atlantic_City__NJ_ALL.csv",
]
cities = ["annapolis", "apalachicola", "atlantic_city"]
# Create one dataframe per CSV file
d = {
city: pd.read_csv(path) for path, city in zip(paths, cities)
}
# Join the frames together, adding the new `city` column
df = (
pd.concat(d.values(), keys=d.keys(), names=["city", None])
.reset_index(level=0)
.reset_index(drop=True)
)
Ok. I figured it out.
Combining what Code Different suggested below but skipping the concatenation part.
I finally get the variables(that are dataframes) with the names of the cities created.
paths = [
"/User1/Research/comp_dataset/yutas_tg/Annapolis__MD_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Apalachicola__FL_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Atlantic_City__NJ_ALL.csv",
]
cities = ["annapolis", "apalachicola", "atlantic_city"]
# Create one dataframe per CSV file
d = {
city: pd.read_csv(path) for path, city in zip(paths, cities)
}
for k in d.keys():
exec(f"{k} = d['{k}']")
Related
#fetch the data in a sequence of 1 million rows as dataframe
df1 = My_functions.get_ais_data(json1)
df2 = My_functions.get_ais_data(json2)
df3 = My_functions.get_ais_data(json3)
df_all = pd.concat([df1,df2,df3], axis = 0 )
#save the data frame with names of the oldest_id and the corresponding iso data format
df_all.to_csv('oldest_id + iso_date +.csv')
.....the last line might be silly but I am trying to save the data frame in the name of some variables I created earlier in the code.
You can use an f-string to embed variables in strings like this:
df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')
if you need the value corresponding to the variable then mids answer is correct thus:
df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')
However if you want to use the name of the variable itselfs :
df_all.to_csv('/path/to/folder/' + f'{oldest_id=}'.split('=')[0] + f'{iso_date=}'.split('=')[0] + '.csv')
would do the work
Maybe try:
file_name = f"{oldest_id}{iso_date}.csv"
df_all.to_csv(file_name)
Assuming you are using Python 3.6 and up.
I am trying to match names from two dataframes (in the name columns) using fuzzywuzzy with process. The result should be df1(dfdum) with the best matching name from df2(dfpep) and the similarity score. This is going very well with the code below but besides the matching name and score I want to append more columns from df2 to df1 in the result. The dates of birth and countries of residence from df2 belonging to the matching name should also be added to df1. I cannot simply merge on names because there are duplicates.
Can anyone help me to amend the code so that i can add the extra info from the matching names from df2? I thus want to add two extra columns to df1 with relating information from the matching name from df2.
pep_name = []
sim_name = []
for i in dfdum.NAME:
ratio = process.extract(i, dfpep.NAME, limit=1,scorer=fuzz.token_set_ratio)
pep_name.append(ratio[0][0])
sim_name.append(ratio[0][1])
dfdum['pep_name'] = pd.Series(pep_name)
dfdum['sim_name'] = pd.Series(sim_name)
You could find the index of the best match in dfpep.NAME, and use that to retrieve the corresponding values of the other two columns.
This code (with some mock data) should give you the desired result; it assumes that dfpep.NAME has only unique values, though.
Please note that I'm far from a pandas expert so this solution is by no means the fastest or most elegant, but it should do the job :)
Also, I feel like there should be a way to do this without the for loop; maybe someone here has an idea for that.
import pandas as pd
from fuzzywuzzy import process, fuzz
dfdum = pd.DataFrame(["Johnny", "Peter", "Ben"])
dfdum.columns = ["NAME"]
dfpep = pd.DataFrame(["Pete", "John", "Bennie"])
dfpep.columns = ["NAME"]
dfpep["dob"] = pd.Series(["1990", "1991", "1992"])
dfpep["cor"] = pd.Series(["USA", "UK", "Germany"])
pep_name = []
sim_name = []
dob = []
cor = []
for i in dfdum.NAME:
ratio = process.extract(i, dfpep.NAME, limit=1, scorer=fuzz.token_set_ratio)
pep_name.append(ratio[0][0])
sim_name.append(ratio[0][1])
j = dfpep.index[dfpep.NAME == ratio[0][0]].tolist()[0]
dob.append(dfpep['dob'][j])
cor.append(dfpep['cor'][j])
dfdum['pep_name'] = pd.Series(pep_name)
dfdum['sim_name'] = pd.Series(sim_name)
dfdum['dob'] = pd.Series(dob)
dfdum['cor'] = pd.Series(cor)
print(dfdum)
Im trying to concatenate 4 different datasets onto pandas python. I can concatenated them but it results in several of the same column names. How do I only produce only one column of the same name, then multiples?
concatenated_dataframes = pd.concat(
[
dice.reset_index(drop=True),
json.reset_index(drop=True),
flexjobs.reset_index(drop=True),
indeed.reset_index(drop=True),
simply.reset_index(drop=True),
],
axis=1,
ignore_index=True,
)
concatenated_dataframes_columns = [
list(dice.columns),
list(json.columns),
list(flexjobs.columns),
list (indeed.columns),
list(simply.columns)
]
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]
concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)
df= concatenated_dataframes
This results in
UNNAMED: 0 TITLE COMPANY DESCRIPTION LOCATION TITLE JOBLOCATION POSTEDDATE DETAILSPAGEURL COMPANYPAGEURL COMPANYLOGOURL SALARY CLIENTBRANDID COMPANYNAME EMPLOYMENTTYPE SUMMARY SCORE EASYAPPLY EMPLOYERTYPE WORKFROMHOMEAVAILABILITY ISREMOTE UNNAMED: 0 TITLE SALARY JOBTYPE LOCATION DESCRIPTION UNNAMED: 0 TITLE SALARY JOBTYPE DESCRIPTION LOCATION UNNAMED: 0 COMPANY DESCRIPTION LOCATION SALARY TITLE
Again, how do i combined all the 'titles' in one column, all the 'location' in one column, and so on? Instead of have multiple of them.
I think we can get away with making a blank dataframe that just has the columns we will want at the end and then concat() everything onto it.
import numpy as np
import pandas as pd
all_columns = list(dice.columns) + list(json.columns) + list(flexjobs.columns) + list(indeed.columns) + list(simply.columns)
all_unique_columns = np.unique(np.array(all_columns)) # this will, as the name suggests, give an end list of just the unique columns. You could run print(all_unique_columns) to make sure it has what you want
df = pd.DataFrame(columns=all_unique_columns)
df = pd.concat([dice, json, flexjobs, indeed, simply],axis=0)
It's a little tricky not having reproducible examples of the dataframes that you have. I tested this on a small mock-up example I put together, but let me know if it works for your more complex example.
I have attached the data here.
Excel Data
I need to return a DataFrame containing list of all employees (EmployeeID, first name, middle name, last name), and their manager's first and last name. The columns in the output DataFrame should be: EmployeeID, FirstName, MiddleName, LastName, ManagerFirstName, ManagerLastName.
Hint: Consider joining the table by itself as managers are employees themselves.
This is code I have so far, which is giving me duplicate records:
# Creating data frame from Excel File. Enter the appropriate file path
df = pd.read_excel(Employees)
df_new = df[['EmployeeID', 'ManagerID', 'FirstName', 'MiddleName', 'LastName']].copy()
df_new['ManagerID'] = pd.to_numeric(df_new['ManagerID'], errors='coerce').fillna(0)
# convert object to int64
df_new['ManagerID'] = df_new['ManagerID'].astype(np.int64)
result = df_new.merge(df_new, left_on='EmployeeID', right_on='ManagerID')
print(result.head())
Any help on this would greatly be appreciated.
I think this will work
df = pd.DataFrame({"EmployeeID":[259,278,204,78,255],
"ManagerID":[278,204,78,255,259],
"FirstName":["ben","garret","gabe","reuben","gordon"],
"MiddleName":["T","R","B","H","L"],
"LastName":["miller","vargas","mares","dsa","hee"]})
df['ManagerID'] = pd.to_numeric(df['ManagerID'], errors='coerce').fillna(0)
df_ = df[["EmployeeID","FirstName","LastName"]]
df_ = df_.rename(columns={"EmployeeID":"ManagerID","FirstName":"ManagerFirstName","LastName":"ManagerLastName"})
out = pd.merge(df,df_,on=["ManagerID"],how="left")
out = out.drop(["ManagerID"],axis=1)
I'm trying to run a script (API to google search console) over a table of keywords and dates in order to check if there was improvement in keyword performance (SEO) after the date.
Since i'm really clueless im guessing and trying but Jupiter notebook isn't responding so i can't even tell if im wrong...
This git was made by Josh Carty
the git from which i took this code is:
https://github.com/joshcarty/google-searchconsole
Already pd.read_csv the input table (consist of two columns 'keyword' and 'date'),
made the columns into two separate lists (or maybe it better to use dictionary/other?):
KW_list and
Date_list
I tried:
for i in KW_list and j in Date_list:
for i in KW_list and j in Date_list:
account = searchconsole.authenticate(client_config='client_secrets.json',
credentials='credentials.json')
webproperty = account['https://www.example.com/']
report = webproperty.query.range(j, days=-30).filter('query', i, 'contains').get()
report2 = webproperty.query.range(j, days=30).filter('query', i, 'contains').get()
df = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
df
Expect to see the data frame of all the different keywords (keyowrd1-stat1 , keyword2 - stats2 below, etc. [no overwrite]) at the dates 30 days before the date in the neighbor cell (in the input file)
or at least some respond from J.notebook so i will know what is going on.
Try using the zip function to combine the lists into a list of tuples. This way, the date and the corresponding keyword are combined.
account = searchconsole.authenticate(client_config='client_secrets.json', credentials='credentials.json')
webproperty = account['https://www.example.com/']
df1 = None
df2 = None
first = True
for (keyword, date) in zip(KW_list, Date_list):
report = webproperty.query.range(date, days=-30).filter('query', keyword, 'contains').get()
report2 = webproperty.query.range(date, days=30).filter('query', keyword, 'contains').get()
if first:
df1 = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
first = False
else:
df1 = df1.append(pd.DataFrame(report))
df2 = df2.append(pd.DataFrame(report2))