Python search through a text file using a Dataframe column - python

I have a very large text file (11 million+ records), ";" delimited, three columns. I have a Pandas dataframe (single column) with the values that I need to search through the text file.
The problem is that I am not able to load the large text file into memory.
I have nested for loops and that is taking a very long time. Searching for each column value within each line of the text file. This is what I have:
import os
import pandas as pd
os.chdir('D:\\AllFiles\\Projects')
mainPath = os.getcwd()
inputFile = 'A.txt'
inputPath = os.path.join(mainPath, inputFile)
input_data = open(inputPath, 'r')
outputFile = 'A_A.csv'
outputPath = os.path.join(mainPath, outputFile)
output_data = open(outputPath,'w')
# input file name and location
actFile = 'SingleColList.txt'
actPath = os.path.join(mainPath, actFile)
# Read the cleaned data in a dataframe
act_df = pd.read_csv(actPath,header=0)
with input_data as f:
for num, line in enumerate(f, 1):
for index, row in act_df.iterrows():
if row['col1'] in line:
output_data.write( line )
input_data.close()
output_data.close()
print('Done!')
Is there something faster that I can use?

Here are some examples that might help
Set up a simple file called 'A.txt' as
a;b;c
cat;dog;mouse
mouse;moose;mice
apple;pie;good
no;its;not
what;is;this
three;word;sentences
great;good;best
stop;doing;this
or;doing;that
Create a search word list
mylist = ['ca', 'why', 'help', 'cat', 'is', 'three', 'best']
Option 1: read file and parse; then search by column. Don't think this is a good option, but shown for comparison purposes.
In [33]: df = pd.read_csv('A.txt', sep=';')
...: df[df['a'].str.contains('|'.join(mylist), regex=True)]
Out[33]:
a b c
0 cat dog mouse
5 three word sentences
Option 2: read file, don't parse so it searches entire row
In [34]: df = pd.read_csv('A.txt', header=0)
...: df[df['a;b;c'].str.contains('|'.join(mylist), regex=True)]
Out[34]:
a;b;c
0 cat;dog;mouse
4 what;is;this
5 three;word;sentences
6 great;good;best
7 stop;doing;this
Option 3: use chunksize and use option 1 or 2 or whatever you need during each iteration
In [35]: iterator = pd.read_csv('A.txt', header=0, chunksize=2)
...: for i in iterator:
...: print(i)
...:
a;b;c
0 cat;dog;mouse
1 mouse;moose;mice
a;b;c
2 apple;pie;good
3 no;its;not
a;b;c
4 what;is;this
5 three;word;sentences
a;b;c
6 great;good;best
7 stop;doing;this
a;b;c
8 or;doing;that

Related

Remove blank cells from CSV file with python

I have a text file that I am converting to csv using python. The text file has columns that are set using several spaces. My code strips the line, converts 2 spaces in a row to commas, and then splits the lines again. When I do this, the columns don't line up because there are some columns that have more blank spaces than others. How can I add something to my code that will remove the blank cells in my csv file?
I have tried converting the csv file to a pandas database, but when I run
import pandas as pd
df = pd.read_csv('old.Csv')
delim_whitespace=True
df.to_csv("New.Csv", index=False)
it returns an error ParserError: Error tokenizing data. C error: Expected 40 fields in line 10, saw 42
The code that is stripping the lines and splitting them is
import csv
txtfile = r"Old.txt"
csvfile = r"Old.Csv"
with open(txtfile, 'r') as infile, open(csvfile, 'w', newline='') as outfile:
stripped = (line.strip() for line in infile)
replace = (line.replace(" ", ",") for line in stripped if line)
lines = (line.split(",") for line in replace if infile)
writer = csv.writer(outfile)
writer.writerows(lines)
One solution is to declare column names beforehand, so as to force pandas to data with different number of columns. Something like this should work :
df = pd.read_csv('myfilepath', names = ['col1', 'col2', 'col3'])
You will have to adapt separator and column names / number of columns yourself.
(edited)below code should work for your text file:
a b c d e
=============================
1 qwerty 3 4 5 6
2 ewer e r y i
3 asdfghjkutrehg c v b n
you can try:
import pandas as pd
df = pd.read_fwf('textfile.txt', delimiter=' ', header=0, skiprows=[1])
df.to_csv("New.csv", index=False)
print(df)
Unnamed: 0 a b c d e
0 1 qwerty 3 4 5 6
1 2 ewer e r y i
2 3 asdfghjkutrehg c v b n

How do i add column header, in the second row in a pandas dataframe?

I have a data frame frame from pandas and now I want to add columns names, but only for the second row. Here is an example of my previous output:
Desired output:
My code:
data_line=open("file1.txt", mode="r")
lines=[]
for line in data_line:
lines.append(line)
for i, line in enumerate(lines):
# print('{}={}'.format(i+1, line.strip()))
file1_header=lines[0]
num_line=1
Dictionary_File1={}
Value_File1= data_type[0:6]
Value_File1_short=[]
i=1
for element in Value_File1:
type=element.split(',')
Value_File1_short.append(type[0] + ", " + type[1] + ", " + type[4])
i += 1
Dictionary_File1[ file1_header]=Value_File1_short
pd_file1=pd.DataFrame.from_dict(Dictionary_File1)
You should have a look at DataFrame.read_csv. The header keyword parameter allows you to indicate a line in the file to use for header names.
You could probably do it with something like:
pd.read_csv("file1.txt", header=1)
From my python shell I tested it out with:
>>> from io import StringIO # I use python3
>>> import pandas as pd
>>> >>> data = """Type Type2 Type3
... A B C
... 1 2 3
... red blue green"""
>>> # StringIO below allows us to use "data" as input to read_csv
>>> # "sep" keyword is used to indicate how columns are separated in data
>>> df = pd.read_csv(StringIO(data), header=1, sep='\s+')
>>> df
A B C
0 1 2 3
1 red blue green
You can write a row using the csv module before writing your dataframe to the same file. Notice this won't help when reading back to Pandas, which doesn't work with "duplicate headers". You can create MultiIndex columns, but this isn't necessary for your desired output.
import pandas as pd
import csv
from io import StringIO
# input file
x = """A,B,C
1,2,3
red,blue,green"""
# replace StringIO(x) with 'file.txt'
df = pd.read_csv(StringIO(x))
with open('file.txt', 'w', newline='') as fout:
writer = csv.writer(fout)
writer.writerow(['Type', 'Type2', 'Type3'])
df.to_csv(fout, index=False)
# read file to check output is correct
df = pd.read_csv('file.txt')
print(df)
# Type Type2 Type3
# 0 A B C
# 1 1 2 3
# 2 red blue green
So if I understand properly, you have a file "file.txt" containing your data, and a list containing the types of your data.
You want to add the list of types, to the pandas.DataFrame of your data. Correct?
If so, you can read the data from the txt file into a pandas.df using pandas.read_csv(), and then define the columns headers using df.columns.
So it would look something like:
df = pd.read_csv("file1.txt", header=None)
df.columns = data_type[0:6]
I hope this helps!
Cheers

How to remove a cell of certain value from a CSV file

I've a csv file which has got a single cell at the last row. Which I need to find a delete. e.g.
Total1254612
The value of total won't be same all the time that is causing problem.
You can leverage the fact you know there will be only one value and that the first 5 letters will be 'Total'. I would just rewrite all the lines that don't meet these conditions to a new file:
f_original = open(fname, 'r')
f_new = open(fname+'_new.csv', 'w')
#iterate through the lines
for line in f_original:
if line.startswith('Total'):
f_new.write(line)
f_original.close()
f_new.close()
This is iterating through the main file & write to a new file without
the cell 'Total'
f_original = open(fname, 'r')
f_new = open(fname+'_new.csv', 'w')
#iterate through the lines
for line in f_original:
if not line.startswith('Total'):
f_new.write(line)
f_original.close()
f_new.close()
Thanks Lucas & Wilbur
You also can use pandas library
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
'A' : [1,2,3,4],
'B' : ['a','b','c','d'],
})
In [3]: df.head()
A B
0 1 a
1 2 b
2 3 c
3 4 d
In [4]: df.drop(df.index[len(df)-1])
A B
0 1 a
1 2 b
2 3 c

Remove row from CSV that contains empty cell using Python

I am splitting a CSV file based on a column with dates into separate files. However, some rows do contain a date but the others cells are empty. I want to remove these rows that contain empty cells from the CSV. But I'm not sure how to do this.
Here's is my code:
csv.field_size_limit(sys.maxsize)
with open(main_file, "r") as fp:
root = csv.reader(fp, delimiter='\t', quotechar='"')
result = collections.defaultdict(list)
next(root)
for row in root:
year = row[0].split("-")[0]
result[year].append(row)
for i,j in result.items():
row_count = sum(1 for row in j)
print(row_count)
file_path = "%s%s-%s.csv"%(src_path, i, row_count)
with open(file_path, 'w') as fp:
writer = csv.writer(fp, delimiter='\t', quotechar='"')
writer.writerows(j)
Pandas is perfect for this, especially if you want this to be easily adjusted to, say, other file formats. Of course one could consider it an overkill.
To just remove rows with empty cells:
>>> import pandas as pd
>>> data = pd.read_csv('example.csv', sep='\t')
>>> print data
A B C
0 1 2 5
1 NaN 1 9
2 3 4 4
>>> data.dropna()
A B C
0 1 2 5
2 3 4 4
>>> data.dropna().to_csv('example_clean.csv')
I leave performing the splitting and saving into separate files using pandas as an exercise to start learning this great package if you want :)
This would skip all all rows with at least one empty cell:
with open(main_file, "r") as fp:
....
for row in root:
if not all(map(len, row)):
continue
Pandas is Best in Python for handling any type of data processing.For help you can go through on link :- http://pandas.pydata.org/pandas-docs/stable/10min.html

skipping unknown number of lines to read the header python pandas

i have an excel data that i read in with python pandas:
import pandas as pd
data = pd.read_csv('..../file.txt', sep='\t' )
the mock data looks like this:
unwantedjunkline1
unwantedjunkline2
unwantedjunkline3
ID ColumnA ColumnB ColumnC
1 A B C
2 A B C
3 A B C
...
the data in this case contains 3 junk lines(lines i don't want to read in) before hitting the header and sometimes it contains 4 or more suck junk lines. so in this case i read in the data :
data = pd.read_csv('..../file.txt', sep='\t', skiprows = 3 )
data looks like:
ID ColumnA ColumnB ColumnC
1 A B C
2 A B C
3 A B C
...
But each time the number of unwanted lines is different, is there a way to read in a table file using pandas without using 'skiprows=' but instead using some command that matches the header so it knows to start reading from the header? so I don't have to click open the file to count how many unwanted lines the file contains each time and then manually change the 'skiprows=' option.
If you know what the header startswith:
def skip_to(fle, line,**kwargs):
if os.stat(fle).st_size == 0:
raise ValueError("File is empty")
with open(fle) as f:
pos = 0
cur_line = f.readline()
while not cur_line.startswith(line):
pos = f.tell()
cur_line = f.readline()
f.seek(pos)
return pd.read_csv(f, **kwargs)
Demo:
In [18]: cat test.txt
1,2
3,4
The,header
foo,bar
foobar,foo
In [19]: df = skip_to("test.txt","The,header", sep=",")
In [20]: df
Out[20]:
The header
0 foo bar
1 foobar foo
By calling .tell we keep track of where the pointer is for the previous line so when we hit the header we seek back to that line and just pass the file object to pandas.
Or using the junk if they all started with something in common:
def skip_to(fle, junk,**kwargs):
if os.stat(fle).st_size == 0:
raise ValueError("File is empty")
with open(fle) as f:
pos = 0
cur_line = f.readline()
while cur_line.startswith(junk):
pos = f.tell()
cur_line = f.readline()
f.seek(pos)
return pd.read_csv(f, **kwargs)
df = skip_to("test.txt", "junk",sep="\t")
Another simple way to achieve a dynamic skiprows would something like this which worked for me:
# Open the file
with open('test.csv', encoding='utf-8') as readfile:
ls_readfile = readfile.readlines()
#Find the skiprows number with ID as the startswith
skip = next(filter(lambda x: x[1].startswith('ID'), enumerate(ls_readfile)))[0]
print(skip)
#import the file with the separator \t
df = pd.read_csv(r'test.txt', skiprows=skip, sep ='\t')

Categories

Resources