I want to use RxPy to open a (csv) file and process the file line by line. My precisely I envision to have the following steps
provide a filename to the stream
open the file
read file line by line
remove lines which start with a comment (e.g. # ...)
apply csv reader
filter records matching some criteria
So far I have:
def to_file(filename):
f = open(filename)
return Observable.using(
lambda: AnonymousDisposable(lambda: f.close()),
lambda d: Observable.just(f)
)
def to_reader(f):
return csv.reader(f)
def print_rows(reader):
for row in reader:
print(row)
This works
Observable.from_(["filename.csv", "filename2.csv"])
.flat_map(to_file).**map**(to_reader).subscribe(print_rows)
This doesn't: ValueError: I/O operation on closed file
Observable.from_(["filename.csv", "filename2.csv"])
.flat_map(to_file).**flat_map**(to_rows).subscribe(print)
The 2nd doesn't work because (see https://github.com/ReactiveX/RxPY/issues/69)
When the observables from the first flatmap is merged by the second flatmap, the inner subscriptions will be disposed when they complete. Thus the files will be closed, even if the file handles are on_next'ed into the new observable set up by the second flatmap.
Any idea how I can achieve:
Something like:
Observable.from_(["filename.csv", "filename2.csv"]
).flat_map(to_file
).filter(comment_lines
).filter(empty_lines
).map(to_csv_reader
).filter(filter_by.. )
).do whatever
Thanks a lot for your help
Juergen
I just started working with RxPy recently and needed to do the same thing. Surprised someone hasn't already answered your question but decided to answer just in case someone else needs to know. Assuming you have a CSV file like this:
$ cat datafile.csv
"iata","airport","city","state","country","lat","long"
"00M","Thigpen ","Bay Springs","MS","USA",31.95376472,-89.23450472
"00R","Livingston Municipal","Livingston","TX","USA",30.68586111,-95.01792778
"00V","Meadow Lake","Colorado Springs","CO","USA",38.94574889,-104.5698933
"01G","Perry-Warsaw","Perry","NY","USA",42.74134667,-78.05208056
"01J","Hilliard Airpark","Hilliard","FL","USA",30.6880125,-81.90594389
Here is a solution:
from rx import Observable
from csv import DictReader
Observable.from_(DictReader(open('datafile.csv', 'r'))) \
.subscribe(lambda row:
print("{0:3}\t{1:<35}".format(row['iata'], row['airport'][:35]))
)
Related
first time poster - long time lurker.
I'm a little rough around the edges when it comes to Python and I'm running into an issue that I imagine has an easy fix.
I have a CSV file that I'm looking to read and perform a semi-advanced lookup.
The CSV in question is not really comma delimited when inspecting it in VS Code except the last "column".
Example (direct format screenshot from the file):
screenshot
The line that seems to have issues is:
import csv
import sys
from util import Node, StackFrontier, QueueFrontier
names = {}
people = {}
titles = {}
def load_data(directory):
with open(f"{directory}/file.csv", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
people[row["id"]] = {
"primaryName": row["primaryName"],
"birthYear": row["birthYear"],
"titles": set()
}
if row["primaryName"].lower() not in names:
names[row["primaryName"].lower()] = {row["nconst"]}
else:
names[row["primaryName"].lower()].add(row["nconst"])
The error I receive is:
File "C:\GitHub\Project\data-test.py", line 24, in load_data
"primaryName": row["primaryName"],
~~~^^^^^^^^^^^^^^^
KeyError: 'primaryName'
I've tried this with other CSV files where they are comma delimited, (screenshot example below):
screenshot
And that works perfectly fine. I noticed the above CSV file has the names in ""s which I imagine could be part of the solution.
Ultimately, if I can get it to work with the code above that would be great - otherwise, is there an easy way to automatically format the CSV file to put quotations around the names and separate the value by commas like the csv that's working above?
Thanks in advance for any help.
I am fairly new to programming, so bear with me!
We have a task at school which we are made to clean up three text files ("Balance1", "Saving", and "Withdrawal") and append them together into a new file. These files are just names and sums of money listed downwards, but some of it is jumbled. This is my code for the first file Balance1:
with open('Balance1.txt', 'r+') as f:
f_contents = f.readlines()
# Then I start cleaning up the lines. Here I edit Anna's savings to an integer.
f_contents[8] = "Anna, 600000"
# Here I delete the blank lines and edit in the 50000 to Philip.
del f_contents[3]
del f_contents[3]
In the original text file Anna's savings is written like this: "Anna, six hundred thousand" and we have to make it look clean, so its rather "NAME, SUM (as integer). When I print this as a list it looks good, but after I have done this with all three files I try to append them together in a file called "Balance.txt" like this:
filenames = ["Balance1.txt", "Saving.txt", "Withdrawal.txt"]
with open("Balance.txt", "a") as outfile:
for filename in filenames:
with open(filename) as infile:
contents = infile.read()
outfile.write(contents)
When I check the new text file "Balance" it has appended them together, but just as they were in the beginning and not with my edits. So it is not "cleaned up". Can anyone help me understand why this happens, and what I have to do so it appends the edited and clean versions?
In the first part, where you do the "editing" of Balance.txt` file, this is what happens:
You open the file in read mode
You load the data into memory
You edit the in memory data
And voila.
You never persisted the changes to any file on the disk. So when in the second part you read the content of all the files, you will read the data that was originally there.
So if you want to concatenate the edited data, you have 2 choices:
Pre-process the data by creating 3 final correct files (editing Balance1.txt and persisting it to another file, say Balance1_fixed.txt) and then in the second part, concatenate: ["Balance1_fixed.txt", "Saving.txt", "Withdrawal.txt"]. Total of 4 data file openings, more IO.
Use only the second loop you have, and correct the contents before writing it to the outfile. You can use readlines() like you did first, edit the specific line and then use writelines(). Total of 3 data file openings, less IO than previous option
I have a file with user's names, one per line, and I need to compare each name in the file to all values in a csv file and make note each time the user name appears in the csv file. I need to make the search as efficient as possible as the csv file is 40K lines long
My example persons.txt file:
Smith, Robert
Samson, David
Martin, Patricia
Simpson, Marge
My example locations.csv file:
GreaterLocation,LesserLocation,GroupName,DisplayName,InBook
NorthernHemisphere,UnitedStates,Pilots,"Wilbur, Andy, super pilot",Yes
WesternHemisphere,China,Pilots,"Kirby, Mabry, loves pizza",Yes
WesternHemisphere,Japan,Drivers,"Samson, David, big kahuna",Yes
NortherHemisphere,Canada,Drivers,"Randos, Jorge",Yes
SouthernHemispher,Australia,Mechanics,"Freeman, Gordon",Yes
NortherHemisphere,Mexico,Pilots,"Simpson, Marge",Yes
SouthernHemispher,New Zealand,Mechanics,"Samson, David",Yes
My Code:
import csv
def parse_files():
with open('data_file/persons.txt', 'r') as user_list:
lines = user_list.readlines()
for user_row in lines:
new_user = user_row.strip()
per = []
with open('data_file/locations.csv', newline='') as target_csv:
DictReader_loc = csv.DictReader(target_csv)
for loc_row in DictReader_loc:
if new_user.lower() in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("\n"+new_user, per)
print("Parse Complete")
def main():
parse_files()
main()
My code currently works. Based on the sample data in the example files, the code matches the 2 instances of "Samson, David" and 1 instance of "Simpson, Marge" in the locations.csv file. I'm hoping that someone can give me guidance on how I might transform either the persons.txt file or the locations.csv file (40K+ lines) so that the process is as efficient as it can be. I think it currently takes 10-15 minutes. I know looping isn't the most efficient, but I do need to check each name and see where it appears in the csv file.
I think #Tomalak's solution with SQLite is very useful, but if you want to keep it closer to your original code, see the version below.
Effectively, it reduces the amount of file opening/closing/reading that is going on, and hopefully will speed things up.
Since your sample is very small, I could not do any real measurements.
Going forward, you can consider using pandas for these kind of tasks - it can be very convenient working with CSVs and more optimized than the csv module.
import csv
def parse_files():
with open('persons.txt', 'r') as user_list:
# sets are faster to match against than lists
# do the lower() here to avoid repetition
user_set = set([u.strip().lower() for u in user_list.readlines()])
# open file at beginning, close after done
# you could also encapsulate the whole thing into a `with` clause if
# desired
target_csv = open("locations.csv", "r", newline='')
DictReader_loc = csv.DictReader(target_csv)
for user in user_set:
per = []
for loc_row in DictReader_loc:
if user in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("\n"+user, per)
print("Parse Complete")
target_csv.close()
def main():
parse_files()
main()
I apologize if this is a very beginner-ish question. But I have a multivariate data set from reddit ( https://files.pushshift.io/reddit/submissions/), but the files are way too big. Is it possible to downsample one of these files down to 20% or less, and either save it as a new file (json or csv) or directly read it as a pandas dataframe? Any help will be very appreciated!
Here is my attempt thus far
def load_json_df(filename, num_bytes = -1):
'''Load the first `num_bytes` of the filename as a json blob, convert each line into a row in a Pandas data frame.'''
fs = open(filename, encoding='utf-8')
df = pd.DataFrame([json.loads(x) for x in fs.readlines(num_bytes)])
fs.close()
return df
january_df = load_json_df('RS_2019-01.json')
january_df.sample(frac=0.2)
However this gave me a memory error while trying to open it. Is there a way to downsample it without having to open the entire file?
The problem is, it is not possible to determine exactly what the 20% of the data is. In order to do that you must first read the entire length of the file and only then you can get an idea of what a 20% would look like.
Reading a large file into memory all at once throws this error generally. You can process this by reading the file line-by-line with below code:
data = []
counter = 0
with open('file') as f:
for line in f:
data.append(json.loads(line))
counter +=1
You should then be able to do this
df = pd.DataFrame([x for x in data]) #you can set a range here with counter/5 if you want to get 20%
I downloaded first of the files, i.e. https://files.pushshift.io/reddit/submissions/RS_2011-01.bz2
decompressed it and looked at the contents. As it happens, it is not a proper JSON but rather JSON-lines - a series of JSON objects, one per line (see http://jsonlines.org/ ). This means you can just cut out as many lines as you want, using any tool you want (for example, a text editor). Or you can just process the file sequentially in your Python script, taking into account every fifth line, like this:
with open('RS_2019-01.json', 'r') as infile:
for i, line in enumerate(infile):
if i % 5 == 0:
j = json.loads(line)
# process the data here
sc.textFile(path) allows to read an HDFS file but it does not accept parameters (like skip a number of rows, has_headers,...).
in the "Learning Spark" O'Reilly e-book, it's suggested to use the following function to read a CSV (Example 5-12. Python load CSV example)
import csv
import StringIO
def loadRecord(line):
"""Parse a CSV line"""
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"])
return reader.next()
input = sc.textFile(inputFile).map(loadRecord)
My question is about how to be selective about the rows "taken":
How to avoid loading the first row (the headers)
How to remove a specific row (for instance, row 5)
I see some decent solutions here: select range of elements but I'd like to see if there is anything more simple.
Thx!
Don't worry about loading the rows/lines you don't need. When you do:
input = sc.textFile(inputFile)
you are not loading the file. You are just getting an object that will allow you to operate on the file. So to be efficient, it is better to think in terms of getting only what you want. For example:
header = input.take(1)[0]
rows = input.filter(lambda line: line != header)
Note that here I am not using an index to refer to the line I want to drop but rather its value. This has the side effect that other lines with this value will also be ignored but is more in the spirit of Spark as Spark will distribute your text file in different parts across the nodes and the concept of line numbers gets lost in each partition. This is also the reason why this is not easy to do in Spark(Hadoop) as each partition should be considered independent and a global line number would break this assumption.
If you really need to work with line numbers I recommend that you add them to the file outside of Spark(see here) and then just filter by this column inside of Spark.
Edit: Added zipWithIndex solution as suggested by #Daniel Darabos.
sc.textFile('test.txt')\
.zipWithIndex()\ # [(u'First', 0), (u'Second', 1), ...
.filter(lambda x: x[1]!=5)\ # select columns
.map(lambda x: x[0])\ # [u'First', u'Second'
.collect()