Grouping dictionaries using Pandas DataFrae from two separate lists - python

I have two pickle files, each of which is a dictionary of satellite images with their date, PM value, and location of sensor. They have the following setup:
{'Image': array([[[...]]], dtype=uint8),
'Date' : '2018-01-01',
'PM' : 100,
'Location' : 'Los Angeles'}
This is ONE input into the dictionary, i.e., if my pickle is called data, then data[10] would be the entry for Jan 09, 2018, such as:
{'Image': array([[[...]]], dtype=uint8),
'Date' : '2018-01-09',
'PM' : 19,
'Location' : 'Los Angeles'}
I want to combine two pickle files, with lengths 1079 and 1023 respectively, into ONE pickle file grouped by the location. So after pairing, when I call data[0], I get the 1079 inputs for the first sensor station, and data[1] will be the 1023 inputs for the second sensor station. The second sensor station is located in San Diego, and has the exact same format as the first station in terms of its dictionary.
Here's what I have:
I read in both pickle files using the following code (ran this code twice w/ different pickle files, so now i have two lists)
LA_data=[]
with open('/work/srs108/LA/data.pkl', 'rb') as f:
while True:
try:
LA_data.append(pkl.load(f))
except EOFError:
break
SD_data=[]
with open('/work/srs108/SD/data.pkl', 'rb') as f:
while True:
try:
SD_data.append(pkl.load(f))
except EOFError:
break
But now I have two lists, and I'm struggling using the following line to convert it to a Pandas dataframe in order to use the groupby function. I don't know what dimensions to give because my data are lists of dictionary inputs.
df = pd.DataFrame(np.array(data).reshape(??,??), columns = ['Image', 'Date', 'PM', 'Location'])
Any tips on how to get my dictionaries into a dataframe in order to group them together? Is this the right way I should go about this?

Could you specify "data" as variable? In case it is like "city_list" in this small rebuild, this might work.
import numpy as np
import pandas as pd
arr1 = np.array([[[1], [2]], [[3], [4]]])
dict1 = {'Image': arr1,
'Date' : '2018-01-01',
'PM' : 100,
'Location' : 'Los Angeles'}
LA_data=[dict1]
arr2 = np.array([[[11], [12]], [[13], [14]]])
dict2 = {'Image': arr2,
'Date' : '2018-01-09',
'PM' : 19,
'Location' : 'San Diego'}
SD_data=[dict2]
city_list = LA_data + SD_data
df = pd.DataFrame(city_list)
print(df.head())

Related

PySpark: Create a subset of a dataframe for all dates

I have a DataFrame that has a lot of columns and I need to create a subset of that DataFrame that has only date values.
For e.g. my Dataframe could be:
1, 'John Smith', '12/10/1982', '123 Main St', '01/01/2000'
2, 'Jane Smith', '11/21/1999', 'Abc St', '12/12/2020'
And my new DataFrame should only have:
'12/10/1982', '01/01/2000'
'11/21/1999', '12/12/2000'
The dates could be of any format and could be on any column. I can use the dateutil.parser to parse them to make sure they are dates. But not sure how to call parse() on all the columns and only filter those that return true to another dataframe, easily.
If you know what you columns the datetimes are in it's easy:
pd2 = pd[["row_name_1", "row_name_2"]]
# or
pd2 = pd.iloc[:, [2, 4]]
You can find your columns' datatype by checking each tuple in your_dataframe.dtypes.
schema = "id int, name string, date timestamp, date2 timestamp"
df = spark.createDataFrame([(1, "John", datetime.now(), datetime.today())], schema)
list_of_columns = []
for (field_name, data_type) in df.dtypes:
if data_type == "timestamp":
list_of_columns.append(field_name)
Now you can use this list inside .select()
df_subset_only_timestamps = df.select(list_of_columns)
EDIT: I realized your date columns might be StringType.
You could try something like:
df_subset_only_timestamps = df.select([when(col(column).like("%/%/%"), col(column)).alias(column) for column in df.columns]).na.drop()
Inspired by this answer. Let me know if it works!

How can I change the .csv file column to something compatible?

I am trying to get a list of directors and calculate their average score based on all the movies I have in this .csv file. I have written some sample code so it is easier to understand. The sample code works fine but when I'm using the columns from the .csv file it gives me this error, '<' not supported between instances of 'str' and 'float'. Here is the sample code:
df = pd.DataFrame(data={"Director":[ 'Christopher Nolan', 'David Fincher', 'Christopher Nolan', 'Quentin Tarantino', 'Quentin Tarantino', 'Christopher Nolan' ],
"Score": [ 8.9, 9.0, 8.8, 7.8, 9.2, 7.9]})
director_list = []
avg_scores = []
for director in np.unique(df["Director"]):
director_list.append(director)
avg_scores.append(df.loc[df["Director"]==director, "Score"].mean())
df = pd.DataFrame(data={"Director":director_list, "Score": avg_scores})
df
If anyone could help I would greatly appreciate it :)
This is the code in my main file that is causing the error.
data = pd.read_csv('movies.csv') # read in file
dataDirector = data
dataDirector.dropna(subset=['Director', 'Score']) # create data set for year score graph
dataDirector.sort_values(by=['Score'], inplace=True) # order the scores
dataDirector.reset_index()
df4 = pd.DataFrame(data={"Director":dataDirector['Director'], "Score": dataDirector['Score']})
director_list4 = []
avg_scores4 = []
for director in np.unique(df4["Director"]):
director_list4.append(director)
avg_scores4.append(df4.loc[df4["Director"]==director, "Score"].mean())
df4 = pd.DataFrame(data={"Director":director_list4, "Score": avg_scores4})
df4
Is it right, that you try to say something like:
if score < x: #do something
Please check if your x is also a float or integer datatype. As the error says, you probably use a string like "6" instead of an integer or float like 6.
Update:
This statement raises the error:
np.unique(df4["Director"])
You can't use it for Strings. Try something like
df4["Director"].unique()

Create different files based off value in dataframe column A and save to different existing folders based off value in dataframe column A

First, I would like to create different files based off the value in dataframe column A FTP_FOLDER_PATH
Second, I would like to save these files to different folders depending on the value in dataframe column A 'FTP_FOLDER_PATH'. These folders already exist and do not need to be created.
I am struggling with how to do this through looping. I have done something similar in the past for the first part, where I just create different files, but I could only figure out how to save them to one folder. I am stuck on trying to save them to multiple folders. In the code, I have included:
the dataframe
what I have attempted which only solves the first part of the problem and
the desired output which all needs to go to the correct FTP folders.
import pandas as pd
import os
FTP_Master_Folder = 'C:/FTP'
df = pd.DataFrame({'FTP_FOLDER_PATH' : ['C:\FTP1', 'C:\FTP2', 'C:\FTP2', 'C:\FTP2', 'C:\FTP3', 'C:\FTP3'],
'NAME' : ['Jon', 'Kat', 'Kat', 'Kat', 'Joe', 'Joe'],
'CARS' : ['Honda', 'Lexus', 'Porsche', 'Saleen s7', 'Tesla', 'Tesla']})
df
for i, x in df.groupby('FTP_FOLDER_PATH'):
#How do I change the below line to loop through and change the directory based on the value of the 'FTP_FOLDER_PATH'
os.chdir(f'{FTP_Master_Folder}')
p = os.path.join(os.getcwd(), i + '.csv')
x.to_csv(p, index=False)
#Desired Ouput to specific FTP folder based on row of dataframe
df_FTP1 = pd.DataFrame({'FTP_FOLDER_PATH' : ['C:\FTP1'],
'NAME' : ['Jon'],
'CARS' : ['Honda']})
df_FTP1
df_FTP2 = pd.DataFrame({'FTP_FOLDER_PATH' : ['C:\FTP2', 'C:\FTP2', 'C:\FTP2'],
'NAME' : ['Kat', 'Kat', 'Kat'],
'CARS' : ['Lexus', 'Porsche', 'Saleen s7']})
df_FTP2
df_FTP3 = pd.DataFrame({'FTP_FOLDER_PATH' : ['C:\FTP3', 'C:\FTP3'],
'NAME' : ['Joe', 'Joe'],
'CARS' : ['Tesla', 'Tesla']})
df_FTP3
I discovered a minor basic error. I should have included /{i} in line 2. i would be the subfolder of the masterfolder in this case, so adding this in allows the files to go to their destinations, so that solves part two of my problem quite easily.
for i, x in df_joined.groupby('FTP_FOLDER_PATH'):
os.chdir(f'{FTP_Master_Folder}/{i}')
p = os.path.join(os.getcwd(), i + '.csv')
x.to_csv(p, index=False)

Replace values from pandas dataset with dictionary

I am extracting a column from excel document with pandas. After that, I want to replace for each row of the selected column, all keys contained in multiple dictionaries grouped in a list.
import pandas as pd
file_loc = "excelFile.xlsx"
df = pd.read_excel(file_loc, usecols = "C")
In this case, my dataframe is called by df['Q10'], this data frame has more than 10k rows.
Traditionally, if I want to replace a value in df I use;
df['Q10'].str.replace('val1', 'val1')
Now, I have a dictionary of words like:
mydic = [
{
'key': 'wasn't',
'value': 'was not'
}
{
'key': 'I'm',
'value': 'I am'
}
... + tons of line of key value pairs
]
Currently, I have created a function that iterates over "mydic" and replacer one by one all occurrences.
def replaceContractions(df, mydic):
for cont in contractions:
df.str.replace(cont['key'], cont['value'])
Next I call this function passing mydic and my dataframe:
replaceContractions(df['Q10'], contractions)
First problem: this is very expensive because mydic has a lot of item and data set is iterate for each item on it.
Second: It seems that doesn't works :(
Any Ideas?
Convert your "dictionary" to a more friendly format:
m = {d['key'] : d['value'] for d in mydic}
m
{"I'm": 'I am', "wasn't": 'was not'}
Next, call replace with the regex switch and pass m to it.
df['Q10'] = df['Q10'].replace(m, regex=True)
replace accepts a dictionary of key-replacement pairs, and it should be much faster than iterating over each key-replacement at a time.

python, Storing and Reading varying dictionary size information in a csv file

I have implemented a python dictionary which has SQL query & results.
logtime = time.strftime("%d.%m.%Y)
sqlDict = { 'time':logtime,
'Q1' : 50,
'Q2' : 15,
'Q3' : 20,
'Q4' : 10,
'Q5' : 30,
}
Each day, the results are written in a CSV file in dictionary Format. Note: Python dictionaries are not odered. so colomns in each row may vary when additional queries (e.g Q7,Q8,Q9...) are added to the dictionary.
('Q1', 25);('Q3', 23);('Q2', 15);('Q5', 320);('Q4', 130);('time', '20.03.2016')
('Q1', 35);('Q2', 21);('Q3', 12);('Q5', 30);('Q4', 10);('time', '21.03.2016')
('Q4', 22);('Q3', 27);('Q2', 15);('Q5', 30);('Q1', 10);('time', '22.03.2016')
With addition of a new SQL query in the dictionary, the additional Information is also saved in the same csv file.
So, e.g. with addition of Q7, the dictionary Looks like
sqlDict = { 'time':logtime,
'Q1' : 50,
'Q2' : 15,
'Q3' : 20,
'Q4' : 10,
'Q5' : 30,
'Q7' : 5,
}
and the csv file will look like
('Q1', 25);('Q3', 23);('Q2', 15);('Q5', 320);('Q4', 130);('time', '20.03.2016')
('Q1', 35);('Q2', 21);('Q3', 12);('Q5', 30);('Q4', 10);('time', '21.03.2016')
('Q4', 22);('Q3', 27);('Q2', 15);('Q5', 30);('Q1', 10);('time', '22.03.2016')
('Q1', 50);('Q3', 20);('Q2', 15);('Q5', 30);('Q4', 10);('time', '23.03.2016');('Q7', 5)
I Need to plot all the Information available in the csv, i.e for all SQL keys, the time vs value(numbers) plot.
The csv file does not hold a regular pattern. In the end, I would like to plot a graph with all available Qs and their corresponding values. Where the Qs are missing in the row, program should assume value 0 for that date.
You just need to process your csv. You know that the last cell of every row is the date so it's pretty formated for me.
import csv
with open("file.csv","r") as f:
spamreader = csv.reader(f,delimiter=";")
for row in spamreader:
for value in range(len(row)):
query,result = value.strip('(').strip(')').split(",")
if query != "time":
# process it
# query = 'QX'
# result = 'N'
else:
# query = 'time'
# result = 'date'
The thing that will bother you is that you will read everything as string, so you will have to split on the coma and strip the '(' and the ')'
for example:
query,result = row[x].strip('(').strip(')').split(", ")
then query = 'Q2' and result = 15 (type = string for both)

Categories

Resources