I'm new to the Pandas library.
I have shared code that works off of a dataframe.
Is there a way to read a gzip file line by line without any delimiter (use the full line, the line can include commas and other characters) as a single row and use it in the dataframe? It seems that you have to provide a delimiter and when I provide "\n" it is able to read but error_bad_lines will complain with something like "Skipping line xxx: expected 22 fields but got 23" fields since each line is different.
I want it to treat each line as a single row in the dataframe. How can this be achieved? Any tips would be appreciated.
if you just want each line to be one row and one column then dont use read_csv. Just read the file line by line and build the data frame from it.
You could do this manually by creating an empty data frame with a single columns header. then iterate over each line in the file appending it to the data frame.
#explicitly iterate over each line in the file appending it to the df.
import pandas as pd
with open("query4.txt") as myfile:
df = pd.DataFrame([], columns=['line'])
for line in myfile:
df = df.append({'line': line}, ignore_index=True)
print(df)
This will work for large files as we only process one line at a time and build the dataframe so we dont use more memory than needed. This probably isnt the most efficent there is a lot of reassigning of the dataframe here but it would certainly work.
However we can do this more cleanly since the pandas dataframe can take an iterable as the input for data.
#create a list to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
mydata = [line for line in myfile]
df = pd.DataFrame(mydata, columns=['line'])
print(df)
Here we read all the lines of the file into a list and then pass the list to pandas to create the data from. However the down side to this is if our file was very large we would essentially have 2 copies of it in memory. One in list and one in the data frame.
Given that we know pandas will accept an iterable for the data so we can use a generator expression to give us a generator that will feed each line of the file to the data frame. Now the data frame will be built its self by reading each line one at a time from the file.
#create a generator to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
mydata = (line for line in myfile)
df = pd.DataFrame(mydata, columns=['line'])
print(df)
In all three cases there is no need to use read_csv since the data you want to load isnt a csv. Each solution provides the same data frame output
SOURCE DATA
this is some data
this is other data
data is fun
data is weird
this is the 5th line
DATA FRAME
line
0 this is some data\n
1 this is other data\n
2 data is fun\n
3 data is weird\n
4 this is the 5th line
Related
I want to modify a data frame in python with the pandas function. So I want to delete the first and the 7th column (Unix Timestamp, Close). I have also moved the column Symbol to the end of the columns. How could I do these transformations to the csv file below. I want the formats to be written in the csv file permanently, would this be possible?
import pandas as pd
url= input.csv
data = pd.read_csv(url, low_memory=False)
original csv:
Formatted csv/Expected Output:
Firstly drop your columns by drop() method:-
data=data.drop(columns=['Unix Timestamp','Close'])
Now use pop() method:-
symbol=data.pop('Symbol')
Finally:-
data['Symbol']=symbol
After that if you wants to change the contents of csv file then save this data to your csv file by to_csv() method
data.to_csv(url)
Note:- If you don't want to save index then pass index=False in to_csv() method
I got a .csv file with 18k lines of data from 11 different measuring devices. I'm trying to copy/write a file for each measuring devices so I can plot them later, get averages easier. However, with this code, I put together, scrambled together from YT tutorials and web sources the only thing that's being written in these files are the "fieldnames"/the names of the columns(whatever the right name for those things are.
It just stops after inserting the first line of the .csv instead of looking for the right value in each line and inserting it into the new .csv files
I've tried to use a for loop which has 11 different if/elif conditions in it which I thought would filter the column of the device_id to the right device file.
import csv
with open('Data.csv', 'r') as Data_puntenOG:
Data_punten = csv.DictReader(Data_puntenOG)
for line in Data_punten:
if line['device_id'] == 'prototype01':
with open('HS361.csv', 'w') as HS361:
csv_HS361 = csv.writer(HS361)
csv_HS361.writerow(line)
elif line['device_id'] == "prototype02":
with open('MinID8.csv', 'w') as MinID8:
csv_MinID8 = csv.writer(MinID8)
csv_MinID8.writerow(line)
and then 9 more of the same elif lines with different names/ conditions from prototype03 until prototype12 with the exception of 9, because that one was not in the .csv file
11 files with only the first line of the .csv
(id,device_id,measurement_type,measurement_value,timestamp)
instead of a large pile of lines with data from the .csv file
if you have installed pandas, this will read the file and write out all rows with the same 'device_id' to a separate file with the name of the file being the 'device_id'.
import pandas as pd
df = pd.read_csv('Data.csv')
EDIT:
for id in df['device_id'].unique():
df[df['device_id'] == id].to_csv(f"{id}.csv")
I think the most convenient way is to use pandas' groupby, because it provides both the unique id's and their corresponding sub dataframes:
import pandas as pd
df = pd.read_csv('Data.csv')
for id, group in df.groupby('device_id'):
group.to_csv(f'{id}.csv')
I cant read the data from a CSV file into memory because it is too large, i.e. doing pandas.read_csv using pandas won't work.
I only want to get data out based on some column values which should fit into memory. Using a pandas dataframe df that could hypothetically contain the full data from the CSV, I would do
df.loc[(df['column_name'] == 1)
The CSV file does contain a header, and it is ordered so I don't really need to use column_name but the order of that column if I have to.
How can I achieve this? I read a bit about pyspark but I don't know if this is something where it can be useful
You can read the CSV file chunk by chunk and retain the rows which you want to have
iter_csv = pd.read_csv('sample.csv', iterator=True, chunksize=10000,error_bad_lines=False)
data = pd.concat ([chunk.loc[chunk['Column_name']==1] for chunk in iter_csv] )
I have a txt file that has xyz coordinates extracted from Kinect. The xyz coordinates is separated by the commas and there is 12 columns. There is around 1200 rows as every movement I make in front of kinect 30 frames are added in one second.
Is your doubt on what you should use to load it?
If so, to load directly into numpy you can use numpy.loadtxt (https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html).
If you want a structure that will allow more flexible access and manipulation of the data, you should use pandas.read_table (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html).
After manipulation you can easily convert the pandas structure into numpy.
This is a sample of how you can read each line of your file and process its data.
This code will:
open file
read lines
split each line at space
print a few infos from each line
for each element that came after the first split, split it again at ,
Code:
#create empty list to store results
rows = []
#open file
with open('filename.txt', 'r') as f:
#read each line of file and store it in rows list
rows = f.readlines()
#for each element in my list, do something
for row in rows:
#split row in each space, so each column will become an element item and attribute it to data
data = row.split()
#print all data content
print(data)
#print only third element in data list
print(data[3])
#split column content at ,
print(data[3].split(',')
Now you can access every item in each column. You just have to play a little with your data and understand how to properly access it.
But you should consider using the tools provided by Filipe Aleixo in his answer, this way you'll be able to better manipulate the data.
I have a large number of files that i want to import. I do this one by one with pandas. But some of them only have header text, and the actual contents is empty. This is on purpose, but I don't know which files are empty. Also, each file has a different number of columns, and the number of columns in each file is unknown. I use the following code:
lines = pandas.read_csv(fname, comment='#', delimiter=',', header=None)
Is there a way for pandas to return an empty data rame if it doesn't find any non-comment lines in a file? Or some other work around?
Thanks!