Split large CSV to Multiple CSV containing each row

Split large CSV to Multiple CSV containing each row - python

I am using Pandas to split large csv to multiple csv each containing single row.
I have a csv having 1 million records and using below code it is taking to much time.
For Eg: In the above case there will be 1 million csv created.
Anyone can help me how to decrease time in splitting csv.
for index, row in lead_data.iterrows():
row.to_csv(row['lead_id']+".csv")
lead_data is the dataframe object.
Thanks

You don't need to loop through the data. Filter records by lead_id and the export the data to CSV file. That way you will be able to split the files based on the lead ID (assuming).
Example, split all EPL games where arsenal was at home:
data=pd.read_csv('footbal/epl-2017-GMTStandardTime.csv')
print("Selecting Arsenal")
ft=data.loc[data['HomeTeam']=='Arsenal']
print(ft.head())
# Export data to CSV
ft.to_csv('arsenal.csv')
print("Done!")
This way it is much faster than using one record at a time.

Related

Replace a row in a pandas dataframe with values from dictionary

I am trying to populate an empty dataframe by using the csv module to iterate over a large tab-delimited file, and replacing each row in the dataframe with these values. (Before you ask, yes I have tried all the normal read_csv methods, and nothing has worked because of dtype issues and how large the file is).
I first made an empty numpy array using np.empty, using the dimensions of my data. I then converted this to a pandas DataFrame. Then, I did the following:
with open(input_file) as csvfile:
reader = csv.DictReader(csvfile,delimiter='\t')
row_num = 0
for row in reader:
for key, value in row.items():
df.loc[row_num,key] = value
row_num += 1
This is working great, except that my file has 900,000 columns, so it is unbelievably slow. This also feels like something that pandas could do more efficiently, but I've been unable to find how. The dictionary for each row given by DictReader looks like:
{'columnName1':<value>,'columnName2':<value> ...}
Where the values are what I want to put in the dataframe in those columns for that row.
Thanks!

So what you could do in this case is to build smaller chunks of your big csv data file. I had the same issue with a 32GB Csv-File, so I had to build chunks. After reading them in you could work with them.
# read the large csv file with specified chunksize
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
chunksize=1000000 sets how many row are read in at once
Helpfull website:
https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c

Searching a large csv

I have a csv reference data file of around 1m rows. I have a csv data file of 3m rows. I need to perform a reference data lookup for each of the 3m rows into the 1m row csv file.
For various reasons I am constrained to python and cvs. I have tried to have the 1m row table in a panda in memory but the whole thing is very slow.
Can someone recommend an alternative approach?

As I mentioned above, a good solution to this type of thing would be to dump the CSV into a sqlite db and the just query as needed :)

Here is one idea.
import csv
# Asks for search criteria from user
search_parts = input("Enter search criteria:\n").split(",")
# Opens csv data file
file = csv.reader(open("C:\\your_path_here\\test.csv"))
# Go over each row and print it if it contains user input.
for row in file:
if all([x in row for x in search_parts]):
print(row)

Pyspark split csv file in packets

I'm very new in spark and I'm still with my first tests with it. I installed one single node and I'm using it as my master on a decent server running:
pyspark --master local[20]
And of course I'm facing some difficulties with my first steps using pyspark.
I have a CSV file of 40GB and around 300 million lines on it. What I want to do is to find the fastest way to split this file over and make small packages of it and store them as CSV files as well. For that I have two scenarios:
First one. Split the file without any criteria. Just split it equally into lets say 100 pieces (3 million rows each).
Second one. The CSV data I'm loading is a tabular one and I have one column X with 100K different IDs. What I woudl like to do is to create a set of dictionaries and create smaller pieces of CSV files where my dictionaries will tell me to which package each row should go.
So far, this is where I'm now:
sc=SparkContext.getOrCreate()
file_1 = r'D:\PATH\TOFILE\data.csv'
sdf = spark.read.option("header","true").csv(file_1, sep=";", encoding='cp1252')
Thanks for your help!

The best (and probably "fastest") way to do this would be to take advantage of the in-built partitioning of RDDs by Spark and write to one CSV file from each partition. You may repartition or coalesce to create the desired number of partitions (let's say, 100) you want. This will give you maximum parallelism (based on your cluster resources and configurations) as each Spark Executor works on the task on one partition at a time.
You may do one of these:
Do a mapPartition over the Dataframe and write each partition to a unique CSV file.
OR df.write.partitionBy("X").csv('mycsv.csv'), which will create one partition (and thereby file) per unique entry in "X"
Note. If you use HDFS to store your CSV files, Spark will automatically create multiple files to store the different partitions (number of files created = number of RDD partitions).

What I did at last was to load the data as a spark dataframe and spark automatically creates equal sized parititions of 128MB (default configuration of hive) and then I used the repartition method to redistribute my rows according the values for a specific column on my dataframe.
# This will load my CSV data on a spark dataframe and will generate the requiered amount of 128MB partitions to store my raw data.
sdf = spark.read.option('header','true').csv(file_1, sep=';', encoding='utf-8')
# This line will redistribute the rows of each paritition according the values on a specific column. Here I'm placing all rows with the same set of values on the same partition and I'm creating 20 of them. (Sparks handle to allocate the rows so the partitions will be the same size)
sdf_2 = sdf.repartition(20, 'TARGET_COLUMN')
# This line will save all my 20 partitions on different csv files
sdf_2.write.saveAsTable('CSVBuckets', format='csv', sep=';', mode='overwrite', path=output_path, header='True')

the easiest way to split a csv file is to use unix utils called split.
Just google split unix command line.
I split my files using split -l 3500 XBTUSDorderbooks4.csv orderbooks

Pulling data from two excel files with openpyxl in python

I am trying to pull data from two excel files using openpyxl, one file includes two columns, employee names and hours worked, the other,two columns, employee names and hourly wage. Ultimately, I'd like the files compared by name, have wage * hours worked, and then dumped into a third sheet by name and wages payable, but at this point, I'm struggling to get the items from two rows in the first sheet into excel to be able to manipulate them.
I thought I'd create two lists from the columns, the combine them into a dictionary, but I don't think that will get me where I need to be.
Any suggestions on how to get this data into python to manipulate it would be fantastic!
import openpyxl
wb = openpyxl.load_workbook("Test_book.xlsx")
sheet=wb.get_sheet_by_name('Hours')
employee_names=[]
employee_hours=[]
for row in sheet['A']:
employee_names.append(row.value)
for row in sheet['B']:
employee_hours.append(row.value)
my_dict=dict(zip(employee_names,employee_hours))
print(my_dict)

A list comprehension may do it. and using zip to iterate over
my_dict = {name:hours for name, hours in zip(sheet['A'], sheet['b'])}
what zip is doing is iterating through parallel lists.

Reading from a specific row/column from and excel csv file

I am a beginner at Python and I'm looking to take 3 specific columns starting at a certain row from a .csv spreadsheet and then import each into python.
For example
I would need to take 1000 rows worth of data from column F starting at
row 12.
I've looked at options using cvs and pandas but I can't figure out how
to have them start importing at a certain row/column.
Any help would be greatly appreciated.

If the spreadsheet is not huge, the easiest approach is to load the entire CSV file into Python using the csv module and then extract the required rows and columns. For example:
import csv
rows = list(csv.reader(file('Book1.csv', 'rb')))
data = [column[5] for column in rows[11:11+1000]]
will do the trick. Remember that Python starts numbering from 0, so column[5] is column F from your spreadsheet and rows[11] is row 12.

CSV files being text files, there is no way to read a certain line. You will have to read line per line, and count... Have a look at the csv module in Python, which will explain how to (easily) read lines. Particularly this section.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split large CSV to Multiple CSV containing each row - python

Related

Replace a row in a pandas dataframe with values from dictionary

Searching a large csv

Pyspark split csv file in packets

Pulling data from two excel files with openpyxl in python

Reading from a specific row/column from and excel csv file

Categories

Resources