Pyspark/python chinese characters not recognized - python

Sorry I know this question has been asked many times, but I really can't find a solution that could solve my problem.
I am using pyspark module in python to read a file:
data = sc.textFile("data/text_data.csv")
After some data cleaning, I get two columns, all of which are Chinese characters. However, the first three records look like below, where elements in the first tuple is the column name.
[('aybh_zw', 'jyaq'),
('������', '�ڶ��綫·�\U000ffd7c�\u0530�ſ�������������'),
('030', 'FF5E84D38B5B48CF97F26B5E6DAB4DD8')]
So I did this transformation next:
second_cleaned_data = first_cleaned_data.map(lambda s: (s[0].encode('UTF-8), s[1].encode('UTF-8'))
However, data becomes below:
[(b'aybh_zw', b'jyaq'),
(b'\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd',
(b'030', b'FF5E84D38B5B48CF97F26B5E6DAB4DD8')]
Where I omitted the second element in the second tuple due to some stackoverflow formatting issues.
Since this is still incorrect, I then did following and get:
[('aybh_zw', 'jyaq'),
('锟斤拷锟斤拷锟斤拷', '锟节讹拷锟界东路锟襟康硷拷园锟脚匡拷锟'),
('030', 'FF5E84D38B5B48CF97F26B5E6DAB4DD8')]
Still, this is incorrect since those characters are not what they should be.
So could anyone help me with this? I really don't know what to do.
Thank you
A sample of csv file is below:
recordkey ajbh ajjf ajlx ajlx_zw ajly ajly_zw ajmc ajzt
QTIwMTUwNjAwMDFfMzcxNDAwMDE A2015060001 0 2 刑事 1 110指令 张俊杰被盗窃案 202 已立案 212000002 盗窃罪 盗窃罪 371499 经济 2.02E+13 东风路电业局宿舍 6/13/15 7:08 19B2569194BB4471E0530390300A15A6

Related

Iterating over array and slicing or making changes in Python

I'm about to pull my hair out on this. I'm not sure why the index in my array is not being implemented in the second column.
I created this array - project_information :
project_information.append([proj_id,project_text])
When I print this out, I get the rows and columns. It contains about 40 rows.
When I iterate through it to print out the contents, everything comes out fine. I am using this:
for i in range(0,len(project_information)):
project_id = project_information[i][0]
project_text = project_information[i][1]
print(project_id)
print (project_text)
The project_text column contains text, while the project_id contains integers. It prints out perfectly, and the index, changes for both project_id and project_text.
However, I need to use the project_text in a different way, and I am really struggling with this. I need to slice the text to a shorter text for reuse. To do this, I tried:
for i in range(0,len(project_information)):
project_id = project_information[i][0]
project_text = project_information[i][1]
print(project_id)
print (project_text)
if len(project_text) > 5000:
trunc_proj_text = project_text[:1000]
else:
trunc_proj_text = project_text
print (project_id)
print(trunc_proj_text)
The problem I'm having here is that though the project_id column is being iterated through properly, the project_text is not. What I am getting is just the text in the first row for the project_text, sliced, and repeated for as many times as the length of the array.
I have tried different ways, and also a while loop, but it is still not working.
I've also looked at these answers for reference - Slicing,indexing and iterating over 2D Numpy arrays,Efficient iteration over slice in Python, iteration over list slices, and I can't seem to see how they can be applied to my problem.
I'm not well-versed in using Numpy, so is this something that it could help with? I'm well aware this might be simple and I'm missing it because I've been working on various aspects of this project for the past weeks, so I would appreciate a bit of consideration in this.
Thanks in advance.
The problem was with the input list here, so the slicing with this code does in fact work. The code to create the input array has now been fixed. The original code to create the input list was concatenating the strings for each entry, so the project_texts for each appeared different from the end, but all had the same beginning. But viewing this on a console, it was hard to see.

Python Generic Data Engine

I have been working on Python for about 1.5yrs and looking for some direction. This is the first time I can't find what I need after doing a lot of searching and must be missing something- most likely searching the wrong terms.
Problem: I am working on an app that has many processes (Could be hundreds or even thousands). Each process may have a unique input and output data format - could be multiline strings, comma separated strings, excel or csv with or without varying headers and many others. I need something that will format the input correctly and handle the output based upon the process. New processes also need to be easily added/defined. I am open to whatever is the best approach, but my thoughts are to use a database that stores the template/data definition and use that to know the format given a process. However, I'm struggling to come up with exactly how, if this is really the best approach, but it needs to be a solution that is scalable. Any direction would be appreciated. Thank you.
A couple simple examples of data
Process 1 example data (multi line string with Header)
Input of
[ABC123, XYZ453, CDE987]
and the resulting data input below would be created:
Barcode
ABC123
XYZ453
CDE987
This code below works, but is not reusable for the example 2.
list = [ABC123, XYZ453, CDE987]
input = "Barcode /r/n"
for l in list:
input = input + l + '/r/n'
Process 2 example input template (comma separated with Header):
Barcode,Location,Param1,Param2
Item1,L1,11,A
Item1,L1,22,B
Item2,L1,33,C
Item2,L2,44,F
Item3,L2,55,B
Item3,L2,66,P
Process 2 example resulting input data (comma separated with Header):
Input of
{'Barcode':['ABC123', 'XYZ453', 'CDE987', 'FGH487', 'YTR123'], 'Location':['Shelf1', 'Shelf2']}
and using the template to create the input data below:
Barcode,Location,Param1,Param2
ABC123,Shelf1,11,A
ABC123,Shelf1,22,B
XYZ453,Shelf1,33,C
XYZ453,Shelf2,44,F
CDE987,Shelf2,55,B
CDE987,Shelf2,66,P
FGH487,Shelf1,11,A
FGH487,Shelf1,22,B
YTR123,Shelf1,33,C
YTR123,Shelf2,44,F
I know how to handle each process with hardcoded loop/dataframe merge, etc. Ive done some abstraction in other cases with dicts. However, how to define/store each format that vary so much and create reusable abstracted code is where I am stuck.
Maybe you can do the output of the functions as a tuple with the keys "datatype" and "output" for the actual output

Read data from CSV and write data to CSV - String to integer

I have a CSV file with 100,000 rows.
Each row in column A is a sentence comprised of both chars and integers.
I want column B to contain only integers.
I want the new columns to be in the same CSV file.
How can I accomplish this?
If I'm understanding your question correctly, I would use .isdigit() to parse the data in column A. I'm frankly not sure what the format of column A is, so I don't know exactly what you would do with this (if you gave more information I could give a more specific answer). Your solution will likely come in a similar form to this:
def find(lines):
B = []
for line in lines:
numbers = [c for c in line if c.isdigit()]
current = int(''.join(numbers))
# current is the concatenation of all
# integers found in column A from left to right
B.append(current)
return B
Let me know if this makes sense or is even in the right track for your solution. Once again, without knowing what you're trying to do, and what A looks like, I'm not sure what your actual goals are.
EDIT
I'm not going to explain the csv stuff for you, mainly because there is a fantastic resource and library for it included in python here. If you have specific questions related to writing csv, definitely post them.
It sounds like you essentially want to pull int values out of column A then add them to a new column B. There are definitely many ways to solve this, but the general form of the problem is for each row you'll filter out the int, then you'll add the filtered int into the new column. I'll list a couple:
Regex: You could use a pattern such as [0-9]+ to pull the string out of A, then use int(whatever that output is) to cast to int, then store those values in B. I'm a sucker for a good regular expression and this one is fairly straight forward. Regexr is a great resource to learn about this and test your pattern.
Use an algorithm similar to above: The above algorithm worked before, but I've updated it slightly. Now that it's been updated it'll return an array of numbers correspondent to numbers in A from left to right. This is relatively sound, but it doesn't necessarily guarantee you have the right integer, given that if the title has an int in it, it'll mess some things up. It is likely one of the more clear ways of doing this, though.

How does unicodecsv.DictReader represent a csv file

I'm currently going through the Udacity course on data analysis in python, and we've been using the unicodecsv library.
More specifically we've written the following code which reads a csv file and converts it into a list. Here is the code:
def read_csv(filename):
with open(filename,'rb')as f:
reader = unicodecsv.DictReader(f)
return list(reader)
In order to get my head around this, I'm trying to figure out how the data is represented in the dictionary and the list, and I'm very confused. Can someone please explain it to me.
For example, one thing I don't understand is why the following throws an error
enrollment['cancel_date']
While the following works fine:
for enrollment in enrollments:
enrollments['cancel_date'] = parse_date(enrollment['cancel_date'])
Hopefully this question makes sense. I'm just having trouble visualizing how all of this is represented.
Any help would be appreciated.
Thanks.
I too landed up here for some troubles related to the course and found this unanswered. However I think you already managed it. Anyway answering here so that someone else might find this helpful.
Like we all know, dictionaries can be accessed like
dictionary_name['key']
and likewise
enrollments['cancel_date'] should also work.
But if you do something like
print enrollments
you will see the structure
[{u'status': u'canceled', u'is_udacity': u'True', ...}, {}, ... {}]
If you notice the brackets, it's like a list of dictionaries. You may argue it is a list of list. Try it.
print enrollments[0][0]
You'll get an error! KeyError.
So, it's like a collection of dictionaries. How to access them? Zoom down to any dictionary (rather rows of the csv) by enrollments[n].
Now you have a dictionary. You can now use freely the key.
print enrollments[0]['cancel_date']
Now coming to your loop,
for enrollment in enrollments:
enrollment['cancel_date'] = parse_date(enrollment['cancel_date'])
What this is doing is the enrollment is the dummy variable capturing each of the iterable element enrollments like enrollments[1], enrollments[2] ... enrollments[n].
So every-time enrollment is having a dictionary from enrollments and so enrollment['cancel_date'] works over enrollments['cancel_date'].
Lastly I want to add a little more thing which is why I came to the thread.
What is the meaning of "u" in u'..' ? Ex: u'cancel_date' = u'11-02-19'.
The answer is this means the string is encoded as an Unicode. It is not part of the string, it is python notation. Unicode is a library that contains the characters and symbol for all of the world's languages.
This mainly happens because the unicodecsv package does not take the headache of tracking and converting each item in the csv file. It reads them as Unicode to preserve all characters. Now that's why Caroline and you defined and used parse_date() and other functions to convert the Unicode strings to the desired datatype. This is all a part of the Data Wrangling process.

Creating dictionary from xlsx: TypeError: argument of type 'Book' is not iterable

I'm pretty new to Python (and the xlrd module), so my code is probably not nearly as compact as it could be. I'm just using it to analyse some data, so it's more important for me to get what I'm doing rather than for me to make the code as compact as possible (though I do hope to improve, so feel free to give me advice on the coding itself, provided you manage to explain it to a 'newbie' :p )
That being said, here's my issue:
Context
I have an xlsx file with data on errors that people made when translating a text. The first column contains a code for the error relative to the text (conceptual errors), the second column contains a code for the translator that made the error. I want to create a dictionary in which the keys are the conceptual error codes, and the values are lists of the different translators that made that conceptual error.
An short fragment from the xlsx (to give you an idea of the codes in the two columns):
1722_Z1_CF5 1722_HT_EV_Z1_F1
1722_Z1_CF1 1722_PE_AL_Z1_F1
1722_Z1_CF9 1722_PE_EVC_Z1_F1
1722_Z1_CF5 1722_PE_LH_Z1_F1
As you can see, the conceptual error '1722_Z1_CF5' has been made by 2 different people ('1722_HT_EV_Z1_F1' and '1722_PE_LH_Z1_F1). The dictionary for this fragment would look something like:
1722_Z1_CF5: 1722_HT_EV_Z1_F1, 1722_PE_LH_Z1_F1
1722_Z1_CF1: 1722_PE_AL_Z1_F1
1722_Z1_CF9: 1722_PE_EVC_Z1_F1
Code
The code below is what I tried to do to create the dictionary.
def TranslatorsPerError(sheet):
TotalConceptualErrors(sheet)
TranslatorsPerError = {}
for row_index in range(sheet.nrows):
if sheet.cell(row_index,0).value in ConceptualErrors and sheet.cell(row_index,0).value not in TranslatorsPerError:
TranslatorsPerError[str(sheet.cell(row_index,0).value)]=[str(sheet.cell(row_index,1).value),]
if sheet.cell(row_index,0).value in ConceptualErrors and sheet.cell(row_index,0).value in TranslatorsPerError:
TranslatorsPerError[str(sheet.cell(row_index,0).value)].append(str(sheet.cell(row_index,1).value))
return TranslatorsPerError
'TotalConceptualErrors' is a function I created that returns a list ('ConceptualErrors') of the conceptual error codes from the first column without duplicates (and it filters out some other information that was also present in the first column, that's why I needed to use this one first).
Problem
The problem is that this function keeps giving me an error: TypeError: argument of type 'Book' is not iterable
I know that problems with iterables can sometimes be solved by casting certain things into a different type, but I'm not sure how I should solve this one. I tried to use 'str()' for different elements, but that didn't solve the problem. Maybe it has something to do with my code, maybe with the nature of dictionaries or xlrd... (looking at the type 'book', my guess would be on the latter).
Any help or feedback on how to fix this would be greatly appreciated. If you need extra information to understand what's going on or what I'm looking for, please ask.
where is ConceptualErrors being set?

Categories

Resources