Read text file python Numpy - python

I have a txt file that has xyz coordinates extracted from Kinect. The xyz coordinates is separated by the commas and there is 12 columns. There is around 1200 rows as every movement I make in front of kinect 30 frames are added in one second.

Is your doubt on what you should use to load it?
If so, to load directly into numpy you can use numpy.loadtxt (https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html).
If you want a structure that will allow more flexible access and manipulation of the data, you should use pandas.read_table (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html).
After manipulation you can easily convert the pandas structure into numpy.

This is a sample of how you can read each line of your file and process its data.
This code will:
open file
read lines
split each line at space
print a few infos from each line
for each element that came after the first split, split it again at ,
Code:
#create empty list to store results
rows = []
#open file
with open('filename.txt', 'r') as f:
#read each line of file and store it in rows list
rows = f.readlines()
#for each element in my list, do something
for row in rows:
#split row in each space, so each column will become an element item and attribute it to data
data = row.split()
#print all data content
print(data)
#print only third element in data list
print(data[3])
#split column content at ,
print(data[3].split(',')
Now you can access every item in each column. You just have to play a little with your data and understand how to properly access it.
But you should consider using the tools provided by Filipe Aleixo in his answer, this way you'll be able to better manipulate the data.

Related

Comparing 2 csv files to remove rows

I have 2 csv files that have information related to each other. Each row of one csv file corresponds to another row in the other file. In order to prepare the data, I needed to remove certain values from the first csv file which resulted in removing certain rows from that file. Now when I print those rows out they jump around. As an example a certain portion of the first csv file jumps from row number 20838 to 20842, 20843, etc. So what I want to do is compare the first csv file that had certain rows removed to the second csv file and remove the rows that are not currently in the first csv file from the second csv file and then reorder all the rows so that both csv files have rows listed from 0 to 20000. I am using Pandas and numpy.
This is the code I have used to remove the information from the first csv file:
data_csv1 = pd.read_csv("address1")
data_csv2 = pd.read_csv("address2")
data_csv1.drop(data.columns[[0]], axis = 1)
data_csv1 = data_csv1[(data_csv1 !=0).all(1)]
How would I go about doing this? I personally do not care if the data is removed or simply ignored, I just need both csv files to contain the same row numbers.
assuming that at start your two files had exact same index, you can pass index of first file to the second file after post processing:
data_csv2 = data_csv2.iloc[data_csv1.index]

Python Pandas read_csv to dataframe without separator

I'm new to the Pandas library.
I have shared code that works off of a dataframe.
Is there a way to read a gzip file line by line without any delimiter (use the full line, the line can include commas and other characters) as a single row and use it in the dataframe? It seems that you have to provide a delimiter and when I provide "\n" it is able to read but error_bad_lines will complain with something like "Skipping line xxx: expected 22 fields but got 23" fields since each line is different.
I want it to treat each line as a single row in the dataframe. How can this be achieved? Any tips would be appreciated.
if you just want each line to be one row and one column then dont use read_csv. Just read the file line by line and build the data frame from it.
You could do this manually by creating an empty data frame with a single columns header. then iterate over each line in the file appending it to the data frame.
#explicitly iterate over each line in the file appending it to the df.
import pandas as pd
with open("query4.txt") as myfile:
df = pd.DataFrame([], columns=['line'])
for line in myfile:
df = df.append({'line': line}, ignore_index=True)
print(df)
This will work for large files as we only process one line at a time and build the dataframe so we dont use more memory than needed. This probably isnt the most efficent there is a lot of reassigning of the dataframe here but it would certainly work.
However we can do this more cleanly since the pandas dataframe can take an iterable as the input for data.
#create a list to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
mydata = [line for line in myfile]
df = pd.DataFrame(mydata, columns=['line'])
print(df)
Here we read all the lines of the file into a list and then pass the list to pandas to create the data from. However the down side to this is if our file was very large we would essentially have 2 copies of it in memory. One in list and one in the data frame.
Given that we know pandas will accept an iterable for the data so we can use a generator expression to give us a generator that will feed each line of the file to the data frame. Now the data frame will be built its self by reading each line one at a time from the file.
#create a generator to feed the data to the dataframe.
import pandas as pd
with open("query4.txt") as myfile:
mydata = (line for line in myfile)
df = pd.DataFrame(mydata, columns=['line'])
print(df)
In all three cases there is no need to use read_csv since the data you want to load isnt a csv. Each solution provides the same data frame output
SOURCE DATA
this is some data
this is other data
data is fun
data is weird
this is the 5th line
DATA FRAME
line
0 this is some data\n
1 this is other data\n
2 data is fun\n
3 data is weird\n
4 this is the 5th line

Python: How to create a new dataframe with first row when a specific value

I am reading csv files into python using:
df = pd.read_csv(r"C:\csvfile.csv")
But the file has some summary data, and the raw data start if a value "valx" is found. If "valx" is not found then the file is useless. I would like to create news dataframes that start when "valx" is found. I have been trying for a while with no success. Any help on how to achieve this is greatly appreciated.
Unfortunately, pandas only accepts skiprows for rows to skip in the beginning. You might want to parse the file before creating the dataframe.
As an example:
import csv
with open(r"C:\csvfile.csv","r") as f:
lines = csv.reader(f, newline = '')
if any('valx' in i for i in lines):
data = lines
Using the Standard Libary csv module, you can read file and check if valx is in the file, if it is found, the content will be returned in the data variable.
From there you can use the data variable to create your dataframe.

python:compare two large csv files by two reference columns and update another column

I have quite large csv file, about 400000 lines like:
54.10,14.20,34.11
52.10,22.20,22.11
49.20,17.30,29.11
48.40,22.50,58.11
51.30,19.40,13.11
and the second one about 250000 lines with updated data for the third column - the first and the second column are reference for update:
52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15
I would like to build third file like:
54.10,14.20,34.11
52.10,22.20,22.15
49.20,17.30,29.15
48.40,22.50,58.15
51.30,19.40,13.11
It has to contain all data from the first file except these lines where value of third column is taken from the second file.
Suggest you look at Pandas merge functions. You should be able to do what you want,, It will also handle the data reading from CSV (create a dataframe that you will merge)
A stdlib solution with just the csv module; the second file is read into memory (into a dictionary):
import csv
with open('file2.csv', 'rb') as updates_fh:
updates = {tuple(r[:2]): r for r in csv.reader(updates_fh)}
with open('file1.csv', 'rb') as infh, open('output.csv', 'wb') as outfh:
writer = csv.writer(outfh)
writer.writerows((updates.get(tuple(r[:2]), r) for r in csv.reader(infh)))
The first with statement opens the second file and builds a dictionary keyed on the first two columns. It is assumed that these are unique in the file.
The second block then opens the first file for reading, the output file for writing, and writes each row from the inputfile to the output file, replacing any row present in the updates dictionary with the updated version.

handling a huge file with python and pytables

simple problem, but maybe tricky answer:
The problem is how to handle a huge .txt file with pytables.
I have a big .txt file, with MILLIONS of lines, short lines, for example:
line 1 23458739
line 2 47395736
...........
...........
The content of this .txt must be saved into a pytable, ok, it's easy. Nothing else to do with the info in the txt file, just copy into pytables, now we have a pytable with, for example, 10 columns and millions of rows.
The problem comes up when, with the content in the txt file, 10 columns x millions lines are directly generated in the paytable BUT, depending on the data on each line of the .txt file, new colums must be created on the pytable. So how to handle this efficiently??
Solution 1: first copy all the text file, line by line into pytable (millions), and then iterate over each row on pytable (millions again) and, depending on the values, generate the new columns needed for the pytable.
Solution 2: read line by line the .txt file, do whatever needed, calculate the new needed values, and then send all the info to a pyrtable.
Solution 3:.....any other efficient and faster solution???
I think that basic problem here is one of the conceptual model. PyTables' Tables only handle regular (or structured) data. However, the data that you have is irregular or unstructured in that the structure is determined as you read the data. Said another way, PyTables needs the column description to be known completely by the time that create_table() is called. There is no way around this.
Since in your problem statement any line may add a new column you have no choice but to do this in two full passes through the data: (1) read through the data and determine the columns and (2) write the data to the table. In pseudocode:
import tables as tb
cols = {}
# discover columns
d = open('data.txt')
for line in d:
for col in line:
if col not in cols:
cols['colname'] = col
# write table
d.seek(0)
f = tb.open_file(...)
t = f.create_table(..., description=cols)
for line in d:
row = line_to_row(line)
t.append(row)
d.close()
f.close()
Obviously, if you knew the table structure ahead of time you could skip the first loop and this would be much faster.

Categories

Resources