Creating multiple csv files from existing csv file python pandas - python

I'm trying to take a large csv file and write a csv file for the sort of two columns. I was able to get the two individual unique values from the file to be able to know which csv files need to be created.
Ex Data:
1,224939.203,1243008.651,1326.774,F,C-GRAD-FILL,09/22/18 07:24:34,
1,225994.242,1243021.426,1301.772,BS,C-GRAD-FILL,09/24/18 08:24:18,
451,225530.332,1243016.186,1316.173,GRD,C-TOE,10/02/18 11:49:13,
452,225522.429,1242996.017,1319.168,GRD,C-TOE KEY,10/02/18 11:49:46,
I would like to create a csv file "C-GRAD-FILL 09-22-18.csv" with all of the data that matches the two values. I cannot decide how to iterate through the data for both values.
def readData(fileName):
df = pd.read_csv(fileName,index_col=False, names+['Number','Northing','Easting','Elevation','Description','Layer','Date'],parse_dates=['Date'] )
##Layers here!!!
layers = df['Layer'].unique()
##Dates here!!! AS DATETIME OBJECTS!!!!
dates = df['Date'].map(lambda t: t.date()).unique()
##Sorted in order
sortedList = df.sort_values(by=['Layer','Date'])

You can use a GroupBy object. First ensure your date is in the correct string format:
df['Date'] = pd.to_datetime(df['Date']).dt.strftime('%m-%d-%y')
To output all files, iterate a GroupBy object:
for (layer, date), group in df.groupby(['Layer', 'Date']):
group.to_csv(f'{layer} {date}.csv', index=False)
Or, for one specific combination:
layer = 'C-GRAD-FILL'
date = '09-22-18'
g = df.groupby(['Layer', 'Date'])
g.get_group((layer, date)).to_csv(f'{layer} {date}.csv', index=False)

Related

How can I read multiple CSV files and merge them in single dataframe in PySpark

I have 4 CSV files with different columns. Some CSV have same column name as well. The details of the csv are:
capstone_customers.csv: [customer_id, customer_type, repeat_customer]
capstone_invoices.csv: [invoice_id,product_id, customer_id, days_until_shipped, product_line, total]
capstone_recent_customers.csv: [customer_id, customer_type]
capstone_recent_invoices.csv: [invoice_id,product_id, customer_id, days_until_shipped, product_line, total]
My code is:
df1 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_customers.csv")
df2 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_invoices.csv")
df3 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_customers.csv")
df4 = spark.read.options(inferSchema='True',header='True',delimiter=',').csv("capstone_recent_invoices.csv")
from functools import reduce
def unite_dfs(df1, df2):
return df2.union(df1)
list_of_dfs = [df1, df2,df3,df4]
united_df = reduce(unite_dfs, list_of_dfs)
but I got the error:
Union can only be performed on tables with the same number of columns, but the first table has 6 columns and the second table has 3 columns;;\n'Union\n:- Relation[invoice_id#234,product_id#235,customer_id#236,days_until_shipped#237,product_line#238,total#239] csv\n+- Relation[customer_id#218,customer_type#219,repeat_customer#220] csv\n
How can I merge in a single data frame and remove same column names using PySpark?
To read multiple files in shark you can make list of all files you want and read them at once, you don't have to read them in order.
Here is an example of code you can use:
path = ['file.cvs','file.cvs']
df = spark.read.options(header=True).csv(path)
df.show()
you can provide list of files or path to files to read, instead of reading one by one. And don't forget about mergeSchema option:
files = [
"capstone_customers.csv",
"capstone_invoices.csv",
"capstone_recent_customers.csv",
"capstone_recent_invoices.csv"
]
df = spark.read.options(inferSchema='True',header='True',delimiter=',', mergeSchema='True').csv(files)
# or
df = spark.read.options(inferSchema='True',header='True',delimiter=',',mergeSchema='True').csv('/path/to/files/')

How to extract a specific value from multiple csv of a directory, and append them in a dataframe?

I have a directory with hundreds of csv files that represent the pixels of a thermal camera (288x383), and I want to get the center value of each file (e.g. 144 x 191), and with each one of the those values collected, add them in a dataframe that presents the list with the names of each file.
Follow my code, where I created the dataframe with the lists of several csv files:
import os
import glob
import numpy as np
import pandas as pd
os.chdir("/Programming/Proj1/Code/Image_Data")
!ls
Out:
2021-09-13_13-42-16.csv
2021-09-13_13-42-22.csv
2021-09-13_13-42-29.csv
2021-09-13_13-42-35.csv
2021-09-13_13-42-47.csv
2021-09-13_13-42-53.csv
...
file_extension = '.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
files = glob.glob('*.csv')
all_df = pd.DataFrame(all_filenames, columns = ['Full_name '])
all_df.head()
**Full_name**
0 2021-09-13_13-42-16.csv
1 2021-09-13_13-42-22.csv
2 2021-09-13_13-42-29.csv
3 2021-09-13_13-42-35.csv
4 2021-09-13_13-42-47.csv
5 2021-09-13_13-42-53.csv
6 2021-09-13_13-43-00.csv
You can loop through your files one by one, reading them in as a dataframe and taking the center value that you want. Then save this value along with the file name. This list of results can then be read in to a new dataframe ready for you to use.
result = []
for file in files:
# read in the file, you may need to specify some extra parameters
# check the pandas docs for read_csv
df = pd.read_csv(file)
# now select the value you want
# this will vary depending on what your indexes look like (if any)
# and also your column names
value = df.loc[row, col]
# append to the list
result.append((file, value))
# you should now have a list in the format:
# [('2021-09-13_13-42-16.csv', 100), ('2021-09-13_13-42-22.csv', 255), ...
# load the list of tuples as a dataframe for further processing or analysis...
result_df = pd.DataFrame(result)

How to create a multiIndex dataframe from a streaming csv file

I'm streaming data to a csv file.
This is the request:
symbols=["SPY", "IVV", "SDS", "SH", "SPXL", "SPXS", "SPXU", "SSO", "UPRO", "VOO"]
each symbols has a list range from (0,8)
this is how it looks like in 3 columns:
-1583353249601,symbol,SH
-1583353249601,delayed,False
-1583353249601,asset-main-type,EQUITY
-1583353250614,symbol,SH
-1583353250614,last-price,24.7952
-1583353250614,bid-size,362
-1583353250614,symbol,VOO
-1583353250614,bid-price,284.79
-1583353250614,bid-size,3
-1583353250614,ask-size,1
-1583353250614,bid-id,N
my end goal Is to reshape the data:
this is what I need to achieved.
the problems that I encounter where:
not being able to group by tiemstamp and not being able to pivot.
1)I tried to crate a dict and so later It can be passed to pandas, but I m missing data in the process.
I need to find the way to group the data that has the same timestamp.it looks like that omit the lines with the same timestamp.
code:
new_data_dict = {}
with open("stream_data.csv", 'r') as data_file:
data = csv.DictReader(data_file, delimiter=",")
for row in data:
item = new_data_dict.get(row["timestamp"], dict())
item[row["symbol"]] = row["value"]
new_data_dict[row['timestamp']] = item
data = new_data_dict
data = pd.DataFrame.from_dict(data)
data.T
print(data.T)
2)this is an other approach, I was able to group by timestamp by creating 2 different data, but I can not split the value column in to multiple columns to be merge late matching indexes.
code:
data = pd.read_csv("tasty_hola.csv",sep=',' )
data1 = data.groupby(['timestamp']).apply(lambda v: v['value'].unique())
data = data.groupby(['timestamp']).apply(lambda v: v['symbol'].unique())
data1 = pd.DataFrame({'timestamp':data1.index, 'value':data1.values})
At this moment I don't know if the logic that I m trying to apply is the correct one. very lost not being able to see the light at the end of the tunnel.
Thank you very much

Extract date from file name and create a new column with that date auto populated in Python

I am concatenating 100 CSV's with names like XXX_XX_20112020.csv to create one file lets say master.csv
Can I Extract date from each file name and create a new column with that date auto populated for all records in that file? Should I be doing this before or after concatenation and how??
If they all follow the same XXX_XX_20112020.csv pattern then just do 'XXX_XX_20112020.csv'.rsplit('_',1)[-1].rsplit('.',1)[0]
import datetime
file_name = 'XXX_XX_20112020.csv'
file_name_ending = file_name.rsplit('_',1)[-1]
date_part = file_name_ending.rsplit('.',1)[0]
date_part_parsed = datetime.datetime.strptime(date_part, "%d%m%Y").date()
So rsplit is just to split the file names on '_' and we do the same to get rid of the suffix i.e '.csv' by splitting on '.'. Now you need to turn the date string into a real date.
Read here:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime
strptime will turn the string into a datetime object when the right format is given.
Now you can make this a function and apply it to all the file names you have.
P.S: rsplit https://docs.python.org/3/library/stdtypes.html#str.rsplit
import os
import pandas as pd
master_df = pd.DataFrame()
for file in os.listdir('folder_with_csvs'):
# we access the last element after an underscore and all before the dot before csv
date_for_file = file.split('_')[-1].split('.')[0]
date_for_file = datetime.datetime.strptime(date_for_file, "%d%m%Y").date()
df = pd.read_csv(file)
# Following line will put your date in the `POST_DATE` column for every record of this file
df['POST_DATE'] = date_for_file
master_df = pd.concat([master_df, df])
# Eventually
master_df.to_csv('master.csv')

Manipulating the values of each file in a folder using a dictionary and loop

How do I go about manipulating each file of a folder based on values pulled from a dictionary? Basically, say I have x files in a folder. I use pandas to reformat the dataframe, add a column which includes the date of the report, and save the new file under the same name and the date.
import pandas as pd
import pathlib2 as Path
import os
source = Path("Users/Yay/AlotofFiles/April")
items = os.listdir(source)
d_dates = {'0401' : '04/1/2019', '0402 : 4/2/2019', '0403 : 04/03/2019'}
for item in items:
for key, value in d_dates.items():
df = pd.read_excel(item, header=None)
df.set_columns = ['A', 'B','C']
df[df['A'].str.contains("Awesome")]
df['Date'] = value
file_basic = "retrofile"
short_date = key
xlsx = ".xlsx"
file_name = file_basic + short_date + xlsx
df.to_excel(file_name)
I want each file to be unique and categorized by the date. In this case, I would want to have three files, for example "retrofile0401.xlsx" that has a column that contains "04/01/2019" and only has data relevant to the original file.
The actual result is pretty much looping each individual item, creating three different files with those values, moves on to the next file, repeats and replace the first iteration and until I only am left with three files that are copies of the last file. The only thing that is different is that each file has a different date and are named differently. This is what I want but it's duplicating the data from the last file.
If I remove the second loop, it works the way I want it but there's no way of categorizing it based on the value I made in the dictionary.
Try the following. I'm only making input filenames explicit to make clear what's going on. You can continue to use yours from the source.
input_filenames = [
'retrofile0401_raw.xlsx',
'retrofile0402_raw.xlsx',
'retrofile0403_raw.xlsx',]
date_dict = {
'0401': '04/1/2019',
'0402': '4/2/2019',
'0403': '04/03/2019'}
for filename in input_filenames:
date_key = filename[9:13]
df = pd.read_excel(filename, header=None)
df[df['A'].str.contains("Awesome")]
df['Date'] = date_dict[date_key]
df.to_excel('retrofile{date_key}.xlsx'.format(date_key=date_key))
filename[9:13] takes characters #9-12 from the filename. Those are the ones that correspond to your date codes.

Categories

Resources