I'm new to python (pandas, numPy, etc.).
I'd like to know the perfect approach to solve this task in the best and performant way.
I have a huge file that has the following format - expect everything is in one line:
{"order_reference":"0658-2147","billing_address_zip_code":"8800"}
{"order_reference":"0453-2200","billing_address_zip_code":"8400"}
{"order_reference":"0554-3027","billing_address_zip_code":"8820"}
{"order_reference":"0382-3108","billing_address_zip_code":"3125"}
{"order_reference":"0534-4059","billing_address_zip_code":"3775"}
{"order_reference":"0118-1566","billing_address_zip_code":"3072"}
{"order_reference":"0384-6897","billing_address_zip_code":"8630"}
{"order_reference":"0361-5226","billing_address_zip_code":"4716"}
{"order_reference":"0313-6812","billing_address_zip_code":"9532"}
{"order_reference":"0344-6262","billing_address_zip_code":"3600"}
What is the easiest way to read this file into a dictionary in python or dataFrame in numPy? The goal is to join the billing_address_zip_code to a big JSON file to get more insights of the order_reference.
I was thinking to solve it with regExp, but as the file is huge, and need to join to another file, I think I should use Pandas, shouldn't I?
Or as all datasets are the same length, I could also insert by length
Is there a function for that to use pandas? I guess this would be the fastest way, but as it isn't standard JSON, I don't know how to do it.
I'm sorry for the beginner questions, but I search quite a bit on the internet and couldn't find the right answer. And it would really help me to figure out the right approach to this kind of tasks.
For any help or links, I'm very thankful.
Simon
PS: Which cloud environment do you use for this kind of tasks? Which works best with python and the data science libraries?
UPDATE
I used the following code to format into a valid JSON and loaded it with json.loads() successfully:
#syntay: python 3
import json
#small test file
my_list = "["+open("orders_play_around.json").read().replace("}{","},\n{")+"]"
d = json.loads(my_list)
So far so good. Now the next challenge, how do I join this json dictionary with another JSON file that has a join on the billing_address_zip_code?
The other JSON looks like this:
{
"data": [
{
"BFS-Nr": 1,
"Raum mit städtischem Charakter 2012": 4,
"Typologie der MS-Regionen 2000 (2)": 3,
"E": 679435,
"Zusatzziffer": 0,
"Agglomerationsgrössenklasse 2012": 1,
"Gemeinde-typen (9 Typen) 2000 (1)": 4,
"N": 235653,
"Stadt/Land-Typologie 2012": 3,
"Städte 2012": 0,
"Gemeinde-Grössenklasse 2015": 7,
"BFS Nr.": 1,
"Sprachgebiete 2016": 1,
"Europäsiche Berggebietsregionen (2)": 1,
"Gemeindename_1": "Aeugst am Albis",
"Anwendungsgebiete für Steuerer-leichterungen 2016": 0,
"Kantonskürzel": "ZH",
"Kanton": 1,
"Metropolräume 2000 (2)": 1,
"PLZ": 8914,
"Bezirk": 101,
"Gemeindetypologie 2012\n(25 Typen)": 237,
"Raumplanungs-regionen": 105,
"Gemeindetypologie 2012\n(9 Typen)": 23,
"Agglomerationen und Kerne ausserhalb Agglomerationen 2012": 261,
"Ortschaftsname": "Aeugst am Albis",
"Arbeitsmarktregionen 2000 (2)": 10,
"Gemeinde-typen\n(22 Typen) 2000 (1)": 11,
"Städtische / Ländliche Gebiete 2000 (1)": 2,
"Grossregionen": 4,
"Gemeindename": "Aeugst am Albis",
"MS-Regionen (2)": 4,
"Touris-mus Regionen 2017": 3,
"DEGURBA 2011 eurostat": 3
},
{....}
}
What is the easiest way to join them on a key PLZ from plz.js and billing_address_zip_code from orders_play_around.json?
I could load it into JSON file without any problems:
plz_data=open('plz.js').read()
plz = json.loads(plz_data)
Sorry for the long message. But hopefully, someone can help me with this easy problem. The goal would be to plot it on a map or on a graph, where I can see which PLZ (zipcode) has the most orders.
Since you mention turning your file to proper JSON is your initial goal, and you don't mind sed, try:
sed 's|}{|}\n{|g' originalfile > result
Note I added in newlines, not commas. Probably better for your future editing. You can use the -i flag so sed edits in place, but this is safer. If you really want to use Python it's not a big deal with standard Python. Safest is to read character by character:
with open("originalfile") as fd:
while True:
ch=fd.read(1)
if not ch: break
if ch =="{": print("\n")
print(ch,end="")
or just replace and print (never tested limits of Python, I'm guessing this will work:
print(open("originalfile").read().replace("}{","}\n{"))
no need for regex for this - It's a bit of overkill. Once this is a proper Json file it will be easier to use, including loading Json through pandas.read_json.
Here's one way.
data = []
with open("originalfile") as fp:
for l in fp:
clean_line = ([x.replace("{","").replace("}\n","").replace("\"","") for x in l.split(",")])
data.append(clean_line)
Then you can convert the data list into a pandas dataframe and export to JSON.
df = pandas.DataFrame(data)
df.to_json()
If you want to remove the text, e.g. "billing_address_zip_code", and keep only data, then you can do
data = []
with open(filepath) as fp:
for l in fp:
splitted = ([x.split(":")[1] for x in l.split(",")])
data.append(([x.replace("}\n","").replace("\"","") for x in splitted]))
Related
I am pretty new to Python and is busy with a bootcamp one of the task I have to complete have me a bit stump. They give as a input txt file that looks like the following:
min:1,2,3,4,5,6
max:1,2,3,4,5,6
avg:1,2,3,4,5,6
The task is that I have to open the txt file in my program and then work out the min, max and the avg of each line. I can do this a long way of doing .readlines(), but they want it in a generic way such that the lines don't matter. They want me to read through the lines with a loop statement and check the first word and then make that word start the operations.
I hope that I have put the question through correctly.
Regards
While your question wasnt entirely clear with how to use use readlines and not , maybe this is what you were looking for.
f=open("store.txt","r")
for i in f.readlines():
func,data=i.split(":")
data=[int(j) for j in data.rstrip('\n').split(",")]
print(func,end=":")
if(func=="max"):
print(max(data))
elif(func=="min"):
print(min(data))
else:
print(sum(data)/len(data))
Next time please try to show your work and ask specific errors , i.e. not how to solve a problem but rather how to change your solution to fix the problem that you are facing
eval() may be useful here.
The name of the math operation to perform is conveniently the first word of each line in the text file, and some are python keywords. So after parsing the file into a math expression I found it tempting to just use python's eval function to perform the operations on the list of numbers.
Note: this is a one-off solution as use of eval is discouraged on unknown data, but safe here as we manage the input data.
avg, is not a built-in operation, so we can define it (and any other operations that are not built-ins) with a lambda.
with open('input.txt', 'r') as f:
data = f.readlines()
clean = [d.strip('\n').split(':') for d in data]
lines = []
# define operations in input file that are not built-in functions
avg = lambda x: sum(x)/float(len(x)) # float for accurate calculation result
for i in clean:
lines.append([i[0], list(map(int, i[1].split(',')))])
for expr in lines:
info = '{}({})'.format(str(expr[0]), str(expr[1]))
print('{} = {}'.format(info, eval('{op}({d})'.format(op=expr[0],d=expr[1]))))
output:
min([1, 2, 3, 4, 5, 6]) = 1
max([1, 2, 3, 4, 5, 6]) = 6
avg([1, 2, 3, 4, 5, 6]) = 3.5
summary: I am searching for misspellings between a bunch of data and it is taking forever
I am iterating through a few CSV files (million lines total?), in each I am iterating through a json sub-value that has maybe 200 strings to search for. For each loop or the json value, I am adding a column to each dataframe, then using a lambdas function to use Levenshtein's search algorithm to find misspellings. I then output the result of any row that contains a potential misspelling
code:
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for v in json_data.values(): #30 ish json values
for row in v["json_search_string"]: #200 ish substrings
df_temp = df
df_temp['jelly'] = row
df_temp['difference'] = df_temp.apply(lambda x: jellyfish.levenshtein_distance(x['search column'],x['jelly']), axis=1)
df_agg = df_temp[df_temp['difference'] <3]
if os.path.isfile(filepath+"levenshtein.csv"):
with open(filepath+"levenshtein.csv", 'a') as f:
df_agg.to_csv(f, header=False)
else:
df_agg.to_csv(filtered_filepath+"levenshtein.csv")
I've tried the same algorithm before, but just to keep it short, instead of itterating through all JSON values for each CSV, I just did a single JSON value like this:
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for row in data['z']['json_search_string']:
#levenshtein algorithm above
The above loop took about 100 minutes to run through! (Edit: it takes about 1-3 seconds for the lambda function to run each time) And there are about 30 of them in the JSON file. Any ideas on how I can condense the algorithm and make it faster? I've thought maybe I could take all 200ish json sub strings and add them each as a column to each df and somehow run a lambda function that searches all columns at once, but I am not sure how to do that yet. This way I would only iterate the 20 files 30 times each, as opposed to however many thousand iterations that the 3rd layer for loop is adding on. Thoughts?
Notes:
Here is an example of what the data might look like:
JSON data
{
"A": {
"email": "blah",
"name": "Joe Blah",
"json_search_string": [
"Company A",
"Some random company name",
"Company B",
"etc",
"..."
And the csv columns:
ID, Search Column, Other Columns
1, Clompany A, XYZ
2, Company A, XYZ
3, Some misspelled company, XYZ
etc
Well, it is really hard to answer performance enhancement question.
Depending on the effort and performance, here are some suggestions.
Small tweaking by re-arrangement of your code logic. Effort: small. Expected enhancement: small. By going through your code, I know that you are going to comparing words from File (number 20) with a fixed JSON File (only one). Instead of reading the JSON File for each File, why not first prepare the fixed words list from the JSON File, and used it for all following comparison? The logic is like:
# prepare fixed words from JSON DATA
fixed_words = []
for v in json_data.values():
fixed_words += v["json_search_string"]
# looping over each file, and compare them with word from fixed words
for f in file_list:
# do the comparison and save.
Using multiprocessing. Effort: Small. Expected Enhancement: Median. Since all your work are similar, why not try multiprocessing? You could apply multiprocessing on each file OR when doing dataframe.apply. There are lots of source for multiprocessing, please have a look. It is easy to implement for your case.
Using other languages to implement Levenshtein distance. The bottleneck of your code is the computing of Levenshtein distance. You used the jellyfish python package, which is a pure python (of course, performance is not good for a large set). Here are some other options:
a. Already existed python package with C/C++ implementation. Effort: small. Expected Enhancement: High. Thanks the comment from #Corley Brigman , editdistance is one option you can use.
b. Self-implementation by Cyphon. Effort: Median. Enhancement: Median or High. Check pandas document Performance
c. Self-implementation by C/C++ as a wrapper. Effort: High; Expected Enhancement: High. Check Wrapping with C/C++
You could use several of my suggestion to gain higher performance.
Hope this would be helpful.
You could change your code to :
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
x_search = x['search column']
for v in json_data.values(): #30 ish json values
for row in v["json_search_string"]: #200 ish substrings
mask = [jellyfish.levenshtein_distance(s1,s2) < 3 for s1,s2 in zip(x_search, row) ]
df_agg = df_temp[mask]
if os.path.isfile(filepath+"levenshtein.csv"):
with open(filepath+"levenshtein.csv", 'a') as f:
df_agg.to_csv(f, header=False)
else:
df_agg.to_csv(filtered_filepath+"levenshtein.csv")
apply return a copy of a serie which can be more expensive:
a = range(10**4)
b = range(10**4,2*(10**4))
%timeit [ (x*y) <3 for x,y in zip(a,b)]
%timeit pd.DataFrame([a,b]).apply(lambda x: x[0]*x[1] < 3 )
1000 loops, best of 3: 1.23 ms per loop
1 loop, best of 3: 668 ms per loop
I'm converting python2.7 scripts to python3.
2to3 makes these kinds of suggestions:
result = result.split(',')
syslog_trace("Result : {0}".format(result), False, DEBUG)
- data.append(map(float, result))
+ data.append(list(map(float, result)))
if (len(data) > samples):
data.pop(0)
syslog_trace("Data : {0}".format(data), False, DEBUG)
# report sample average
if (startTime % reportTime < sampleTime):
- somma = map(sum, zip(*data))
+ somma = list(map(sum, list(zip(*data))))
# not all entries should be float
# 0.37, 0.18, 0.17, 4, 143, 32147, 3, 4, 93, 0, 0
averages = [format(sm / len(data), '.3f') for sm in somma]
I'm sure the makers of Python3 did not want to do it like that. At least, it gives me a "you must be kidding" feeling.
Is there a more pythonic way of doing this?
What's wrong with the unfixed somma?
2to3 cannot know how somma is going to be used, in that case, as a generator in the next line to compute averages it is OK and optimal, no need to convert it as a list.
That's the genius of python 3 list to generator changes: most people used those lists as generators already, wasting precious memory materializing lists they did not need.
# report sample average
if (startTime % reportTime < sampleTime):
somma = map(sum, zip(*data))
# not all entries should be float
# 0.37, 0.18, 0.17, 4, 143, 32147, 3, 4, 93, 0, 0
averages = [format(sm / len(data), '.3f') for sm in somma]
Of course the first statement, unconverted, will fail since we append a generator whereas we need a list. In that case, the error is quickly fixed.
If left like this: data.append(map(float, result)), the next trace shows something fishy: 'Data : [<map object at 0x00000000043DB6A0>]', that you can quickly fix by cnverting to list as 2to3 suggested.
2to3 does its best to create running Python 3 code, but it does not replace manual rework or produce optimal code. When you are in a hurry you can apply it, but always check the diffs vs the old code like the OP did.
The -3 option of latest Python 2 versions print warnings when an error would be raised using Python 3. It's another approach, better when you have more time to perform your migration.
I'm sure the makers of Python3 did not want to do it like that
Well, the makers of Python generally don't like seeing Python 2 being used, I've seen that sentiment being expressed in pretty much every recent PyCon.
Is there a more pythonic way of doing this?
That really depends on your interpretation of Pythonic, list comps seem more intuitive in your case, you want to construct a list so there's no need to create an iterator with map or zip and then feed it to list().
Now, why 2to3 chose list() wrapping instead of comps, I do not know; probably easiest to actually implement.
I have a file with id numbers along with the specifics on event that is logged. (time, temp, location)
I want python to group all of the same i.d.'s into their own unique files, storing all of the even specifics from each record.
That is to say go through each record, if the id does not have a log file create one, if it does log the new record into that id's file.
Example input
1, 11:00, 70, port A
1, 11:02, 70, port B
2, 11:00, 40, blink
3, 11:00, 30, front
Desired output
file name "1" with :[11:00, 70, port A ;
11:02, 70, port B ]
file name "2" with :[11:00, 40, blink]
file name "3" with :[11:00, 30, front]
I am very new to python and I am having trouble finding a reference guide. If anyone knows a good place where I can look for an answer I would appreciate it.
This is pretty straightforward - presuming your file is as you described it however I am in Python 2.7 so I am not sure what the differences could be.
all_lines = open('c:\\path_to_my_file.txt').readlines()
from collections import defaultdict
my_classified_lines = defaultdict(list)
for line in all_lines:
data_type, value1,value2,value3 = line.split(',')
my_classified_lines[data_type).append(','.join([value1,value2,value3])
for data_type in my_classified_lines:
outref = open('c:\\directory\\' + data_type + '.txt', 'w')
outref.writelines(my_classified_lines[data_type])
outref.close()
to understand this you need to learn about dictionaries - useful containers for data,
file operations
and loops
I found Dive into Python a great resource when I was starting up
I may not have exactly what you want after looking at your output mine would be like
11:00, 70, port A
11:02, 70, port B
I think you are stating that you want a list like object with semi-colons as the separator which to me suggests you are asking for a string with brackets around it. If your output is really as you describe it then try this
for data_type in my_classified_lines:
outref = open('c:\\directory\\' + data_type + '.txt', 'w')
out_list =[ ';'.join([item for item in my_classified_lines[data_type'])]
outref.writelines(outlist) # should be only one line
outref.close()
Well,My boss ask me to do this:
check and find the revision of the initialization handle for a variation.
but there thousands of variations.
I can use client.diff now ,but how can i get all versions of one file ?
Do you need the list of revision numbers? If so, use the Client.log() function. It will return a list of all revisions of the specified file.
In [48]: url='http://svn.apache.org/repos/asf/httpd/httpd/trunk/README'
In [49]: log = c.log(url)
In [50]: [x.revision.number for x in log]
Out[50]:
[1209505,
1209499,
1150179,
1129808,
739831,
494716,
490083,
350202,
106103,
97800,
94766,
94066,
92186,
91989,
90516,
90357,
90149,
89992,
87515,
87481,
87473,
87470]
Or, if you need the actual contents of each of the revisions of the file, try this:
all_versions = { x.revision.number : c.cat(url, x.revision) for x in c.log(url) }