How do I save a dataframe in the name of a variable I created earlier in the code (oldest_id and iso_data as seen in the code)

How do I save a dataframe in the name of a variable I created earlier in the code (oldest_id and iso_data as seen in the code) - python

#fetch the data in a sequence of 1 million rows as dataframe
df1 = My_functions.get_ais_data(json1)
df2 = My_functions.get_ais_data(json2)
df3 = My_functions.get_ais_data(json3)
df_all = pd.concat([df1,df2,df3], axis = 0 )
#save the data frame with names of the oldest_id and the corresponding iso data format
df_all.to_csv('oldest_id + iso_date +.csv')
.....the last line might be silly but I am trying to save the data frame in the name of some variables I created earlier in the code.

You can use an f-string to embed variables in strings like this:
df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')

if you need the value corresponding to the variable then mids answer is correct thus:
df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')
However if you want to use the name of the variable itselfs :
df_all.to_csv('/path/to/folder/' + f'{oldest_id=}'.split('=')[0] + f'{iso_date=}'.split('=')[0] + '.csv')
would do the work

Maybe try:
file_name = f"{oldest_id}{iso_date}.csv"
df_all.to_csv(file_name)
Assuming you are using Python 3.6 and up.

Related

How do i add two dates that are saved in .json files?

I am having a hard time summing two dates that are saved in two separate json files. I want to add set dates together which are saved in separate libraries.
The first file (A1.json) contains: {"expires": "2019-09-11"}
The second file (Whitelist.json) contains: {"expires": "0000-01-00"}
These dates are created by using tkcalendar and are later exported to these seperate files, the idea being that summing them lets me set a time date one month into the future. However, I can't seem to add them together without some form of an error.
I have tried converting the json files to strings in python and then adding them and also using the striptime command to sum the dates.
Here is the relevant chunk of the code:
{with open('A1.json') as f:
data=json.loads(f.read())
for material in data.items():
A1 = (format(material[1]['expires']))
with open('Whitelist.json') as f:
data=json.loads(f.read())
for material in data.items():
A2 = (format(material[1]['expires']))
print(A1+A2)}
When this is used, they just get pasted one after another. They don't get summed the way I need.
I also have tried the following code:
{t1 = dt.datetime.strptime('A1', '%d-%m-%Y')
t2 = dt.datetime.strptime('Whitelist', '%d-%m-%Y')
time_zero = dt.datetime.strptime('00:00:00', '%d/%m/%Y')
print((t1 - time_zero + Whitelist).time())}
However, this constantly gives out ValueError: time data does not match format '%y:%m:%d'.
What I expect is the sum of 2019-09-11 and 0000-01-00's result is 2019-10-11. However, the result is 2019-09-110000-01-00. Trying the strptime method gives out ValueErrors such as: ValueError: time data does not match format '%y:%m:%d'.
Thank you in advance, and I apologize if I did something wrong on my first post.

Use pandas:
the actual format of the json file isn't provided, so use something like the following to get the data into a DataFrame:
pd.read_json('A1.json', orient='records'): parameters will depend on the format of the file
json_normalize
d2 is not a proper datetime format so don't try to convert it.
the Code section below, will use a dict to set up the DataFrame for the example.
json files to DataFrames:
df1 = pd.read_json('A1.json', orient='records')
df2 = pd.read_json('Whitelist.json', orient='records')
df = pd.DataFrame()
df['expires'] = df1.expires
df['d2'] = df2.expires
Code:
import pandas as pd
df = pd.DataFrame({"expires": ["2019-09-11", "2019-10-11", "2019-11-11"],
"d2": ["0000-01-00", "0000-02-00", "0000-03-00"]})
Expand d2 using str.split:
df.expires = pd.to_datetime(df.expires)
df[['y', 'm', 'd']] = df.d2.str.split('-', expand=True)
Use pd.DateOffset:
df['expires_new'] = df[['expires', 'm']].apply(lambda x: x[0] + pd.DateOffset(months=int(x[1])), axis=1)
if d2 is expected to have more than just a new m or month value, the lambda expression can be changed to call a function that adjusts for y, m, and d values.

Performing similar analysis on multiple dataframes

I am reading data from multiple dataframes.
Since the indexing and inputs are different, I need to repeat the pairing and analysis. I need dataframe specific outputs. This pushes me to copy paste and repeat the code.
Is there a fast way to refer to multiple dataframes to do the same analysis?
DF1= pd.read_csv('DF1 Price.csv')
DF2= pd.read_csv('DF2 Price.csv')
DF3= pd.read_csv('DF3 Price.csv') # These CSV's contain main prices
DF1['ParentPrice'] = FamPrices ['Price1'] # These CSV's contain second prices
DF2['ParentPrice'] = FamPrices ['Price2']
DF3['ParentPrice'] = FamPrices ['Price3']
DF1['Difference'] = DF1['ParentPrice'] - DF1['Price'] # Price difference is the output
DF2['Difference'] = DF2['ParentPrice'] - DF2['Price']
DF3['Difference'] = DF3['ParentPrice'] - DF3['Price']```

It is possible to parametrize strings using f-strings, available in python >= 3.6. In an f string, it is possible to insert the string representation of the value of a variable inside the string, as in:
>> a=3
>> s=f"{a} is larger than 11"
>> print(s)
3 is larger than 1!
Your code would become:
list_of_DF = []
for symbol in ["1", "2", "3"]:
df = pd.read_csv(f"DF{symbol} Price.csv")
df['ParentPrice'] = FamPrices [f'Price{symbol}']
df['Difference'] = df['ParentPrice'] - df['Price']
list_of_DF.append(df)
then DF1 would be list_of_DF[0] and so on.
As I mentioned, this answer is only valid if you are using python 3.6 or later.

for the third part ill suggest to create a something like
DFS=[DF1,DF2,DF3]
def create_difference(dataframe):
dataframe['Difference'] = dataframe['ParentPrice'] - dataframe['Price']
for dataframe in DFS:
create_difference(dataframe)
for the second way there is no like superconvenient and short way i might think about , except maybe of
for i in range len(DFS) :
DFS[i]['ParentPrice'] = FamPrices [f'Price{i}']

Iterate a piece of code connecting to API using two variables pulled from two lists

I'm trying to run a script (API to google search console) over a table of keywords and dates in order to check if there was improvement in keyword performance (SEO) after the date.
Since i'm really clueless im guessing and trying but Jupiter notebook isn't responding so i can't even tell if im wrong...
This git was made by Josh Carty
the git from which i took this code is:
https://github.com/joshcarty/google-searchconsole
Already pd.read_csv the input table (consist of two columns 'keyword' and 'date'),
made the columns into two separate lists (or maybe it better to use dictionary/other?):
KW_list and
Date_list
I tried:
for i in KW_list and j in Date_list:
for i in KW_list and j in Date_list:
account = searchconsole.authenticate(client_config='client_secrets.json',
credentials='credentials.json')
webproperty = account['https://www.example.com/']
report = webproperty.query.range(j, days=-30).filter('query', i, 'contains').get()
report2 = webproperty.query.range(j, days=30).filter('query', i, 'contains').get()
df = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
df
Expect to see the data frame of all the different keywords (keyowrd1-stat1 , keyword2 - stats2 below, etc. [no overwrite]) at the dates 30 days before the date in the neighbor cell (in the input file)
or at least some respond from J.notebook so i will know what is going on.

Try using the zip function to combine the lists into a list of tuples. This way, the date and the corresponding keyword are combined.
account = searchconsole.authenticate(client_config='client_secrets.json', credentials='credentials.json')
webproperty = account['https://www.example.com/']
df1 = None
df2 = None
first = True
for (keyword, date) in zip(KW_list, Date_list):
report = webproperty.query.range(date, days=-30).filter('query', keyword, 'contains').get()
report2 = webproperty.query.range(date, days=30).filter('query', keyword, 'contains').get()
if first:
df1 = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
first = False
else:
df1 = df1.append(pd.DataFrame(report))
df2 = df2.append(pd.DataFrame(report2))

Update values in a column while looping over through a pandas dataframe

I am working on a script to extract some details from images. I am trying to loop over a dataframe that has my image names. How can I add a new column to the dataframe, that populates the extracted name appropriately against the image name?
for image in df['images']:
concatenated_name = ''.join(name)
df.loc[image, df['images']]['names'] = concatenated_name
Expected:
Index images names
0 img_01 TonyStark
1 img_02 Thanos
2 img_03 Thor
Got:
Index images names
0 img_01 Thor
1 img_02 Thor
2 img_03 Thor

Use apply to apply a function on each row:
def get_name(image):
# Code for getting the name
return name
df['names'] = df['images'].apply(get_name)
Follwing your answer that added some more details, it should be possible to shorten it to:
def get_details(filename):
image = os.getcwd() + filename
data = pytesseract.image_to_string(Image.open(image))
.
.
.
data = ''.join(a)
return data
df['data'] = df['filenames'].apply(get_details)
# save df to csv / excel / other

After multiple trials, I think I have a viable solution to this question.
I was using nested function for this exercise, such that function 1 loops over a dataframe of files and calls to function 2 to extract text, perform validation and return a value if the image had the expected field.
First, I created an empty list which would be populated during each run of function 2. At the end, the user can choose to use this list to create a dataframe.
# dataframes to store data
df = pd.DataFrame(os.listdir(), columns=['filenames'])
df = df[df['filenames'].str.contains(".png|.jpg|.jpeg")]
df['filenames'] = '\\' + df['filenames']
df1 = [] #Empty list to record details
# Function 1
def extract_details(df):
for filename in df['filenames']:
get_details(filename)
# Function 2
def get_details(filename):
image = os.getcwd() + filename
data = pytesseract.image_to_string(Image.open(image))
.
.
.
data = ''.join(a)
print(filename, data)
df1.append([filename, data])
df_data = pd.DataFrame(df1, columns=['filenames', 'data']) # Container for final output
df_data.to_csv('data_list.csv') # Write output to a csv file
df_data.to_excel('data_list.xlsx') # Write output to an excel file

Pandas KeyError: value not in index

I have the following code,
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]] = p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]].astype(int)
It has always been working until the csv file doesn't have enough coverage (of all week days). For e.g., with the following .csv file,
DOW,Hour,Changes
4Wed,01,237
3Tue,07,2533
1Sun,01,240
3Tue,12,4407
1Sun,09,2204
1Sun,01,240
1Sun,01,241
1Sun,01,241
3Tue,11,662
4Wed,01,4
2Mon,18,4737
1Sun,15,240
2Mon,02,4
6Fri,01,1
1Sun,01,240
2Mon,19,2300
2Mon,19,2532
I'll get the following error:
KeyError: "['5Thu' '7Sat'] not in index"
It seems to have a very easy fix, but I'm just too new to Python to know how to fix it.

Use reindex to get all columns you need. It'll preserve the ones that are already there and put in empty columns otherwise.
p = p.reindex(columns=['1Sun', '2Mon', '3Tue', '4Wed', '5Thu', '6Fri', '7Sat'])
So, your entire code example should look like this:
df = pd.read_csv(CsvFileName)
p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)
columns = ["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]
p = p.reindex(columns=columns)
p[columns] = p[columns].astype(int)

I had a very similar issue. I got the same error because the csv contained spaces in the header. My csv contained a header "Gender " and I had it listed as:
[['Gender']]
If it's easy enough for you to access your csv, you can use the excel formula trim() to clip any spaces of the cells.
or remove it like this
df.columns = df.columns.to_series().apply(lambda x: x.strip())

please try this to clean and format your column names:
df.columns = (df.columns.str.strip().str.upper()
.str.replace(' ', '_')
.str.replace('(', '')
.str.replace(')', ''))

I had the same issue.
During the 1st development I used a .csv file (comma as separator) that I've modified a bit before saving it.
After saving the commas became semicolon.
On Windows it is dependent on the "Regional and Language Options" customize screen where you find a List separator. This is the char Windows applications expect to be the CSV separator.
When testing from a brand new file I encountered that issue.
I've removed the 'sep' argument in read_csv method
before:
df1 = pd.read_csv('myfile.csv', sep=',');
after:
df1 = pd.read_csv('myfile.csv');
That way, the issue disappeared.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I save a dataframe in the name of a variable I created earlier in the code (oldest_id and iso_data as seen in the code) - python

You can use an f-string to embed variables in strings like this: df_all.to_csv(f'/path/to/folder/{oldest_id}{iso_date}.csv')

Maybe try: file_name = f"{oldest_id}{iso_date}.csv" df_all.to_csv(file_name) Assuming you are using Python 3.6 and up.

Related

How do i add two dates that are saved in .json files?

Performing similar analysis on multiple dataframes

Iterate a piece of code connecting to API using two variables pulled from two lists

Update values in a column while looping over through a pandas dataframe

Pandas KeyError: value not in index

Categories

Resources