I'm trying to convert a Pandas Dataframe into a list, which works but I have some issues with the encoding. I hope someone can give me advice on how to handle this problem. Right now, I'm using Python 2.7.
I'm loading an excel file and it loads correctly.
I'm using following code and I get following output:
germanStatesExcelFile='German_States.xlsx'
ePath_german_states=(os.path.dirname(__file__))+'/'+germanStatesExcelFile
german_states = pd.read_excel(ePath_german_states)
print("doc " + str(german_states))
Output:
states
0 baden-württemberg
1 bayern
2 hessen
3 rheinland-pfalz
4 saarland
5 nordrhein-westfalen
The next step is converting this Dataframe into a list, which I do with following code:
german_states = german_states['states'].tolist()
Output:
[u'baden-w\xfcrttemberg', u'bayern', u'hessen', u'rheinland-pfalz', u'saarland', u'nordrhein-westfalen']
It seems like the list is converting utf-8 not right. so i tried following step:
german_states = [x.encode('utf-8') for x in german_states]
Output:
['baden-w\xc3\xbcrttemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']
I would like to have following Output:
['baden-württemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']
Little late to the party, but if encoding to utf-8 like below doesn't work, you could use the unicodedata.normalize module
german_states_decoded = [x.encode('utf-8') for x in german_states]
If your strings only contain ascii characters, you could try python's in-built str, as below. This works with the strings you provided, but may not necessarily be the case.
Otherwise, there are a number of good answers to a similar question.
german_states = [u'baden-w\xfcrttemberg', u'bayern', u'hessen', u'rheinland-pfalz', u'saarland', u'nordrhein-westfalen']
german_states = list(map(str, german_states))
# ['baden-württemberg', 'bayern', 'hessen', 'rheinland-pfalz', 'saarland', 'nordrhein-westfalen']
I'm trying something very simple on python.
zips = sempmme['Zip code'].unique()
I want to apply zipcode.isequal('12345') for each zips but I'm not sure how to do it in pythonic efficient way.
I tried 'zipcode.isequal(lambda x: x in zips)' and even for loop but I can't seem to get it.
for i in range(0, len(zips)):
#print(zips[i])
cities[i] = zipcode.isequal("" + zips[i])
It shows 'isequal() can only take string'. Needless to say, this is the first time I'm coding in Python. And figured the best way to learn is to take a project and figure it out.
EDIt:
output of repr(zips):
"array([u'25404', u'265056555', u'251772049', u'25177', u'26508', u'25262',\n u'26554', u'265053816', u'154741359', u'15461', u'26250',\n u'262413392', u'25443', u'26505', u'258809366', u'217331141',\n u'26757', u'26201', u'25419', u'25427', u'25401', u'26003',\n u'25428', u'26150', u'268479803', u'24426', u' ', u'25813',\n u'253099769', u'22603', u'25174', u'25984', u'25430', u'25438',\n u'268360008', u'254356541', u'26170', u'25971', u'24622', u'24986',\n u'26847', u'24957', u'25963', u'25064', u'260039425', u'25526',\n u'25523', u'26452', u'25143', u'26301', u'25285', u'26104',\n u'25951', u'25206', u'24740', u'252137436', u'25420', u'26330',\n u'24701', u'25309', u'25304', u'26408', u'25564', u'26753',\n u'15349', u'45767', u'25213', u'25168', u'25302', u'24931',\n u'26623', u'25704', u'26362', u'24966', u'250641730', u'26415',\n u'25130', u'26134', u'25413', u'26101', u'25193', u'26354',\n u'260031309', u'26651', u'24954', u'26180', u'256700145', u'26033',\n u'26444', u'25661', u'26555', u'264521704', u'25111', u'25043',\n u'26278', u'25560', u'25181', u'25854', u'259210233', u'24874',\n u'26181', u'24963', u'254381574', u'25557', u'26203', u'26836',\n u'255109768', u'25035', u'25214', u'26726', u'25132', u'25411',\n u'24853', u'26750', u'25071', u'25913', u'26374', u'25110',\n u'24901', u'25843', u'25880', u'26610', u'26456', u'41514',\n u'26684', u'25541', u'25311', u'26431', u'26241', u'26541',\n u'25162', u'25312', u'24801', u'26159', u'25239', u'255269325',\n u'26293', u'249460055', u'25149', u'26743', u'261871112', u'25315',\n u'25570', u'25123', u'254300341', u'25705', u'25421', u'24747',\n u'261709789', u'26438', u'26448', u'263011836', u'26041', u'25248',\n u'24739', u'25125', u'25510', u'26531', u'251860464', u'263690126',\n u'26205', u'25678', u'251238805', u'25320', u'249707005', u'25414',\n u'26133', u'263850384', u'26501', u'25405', u'25882', u'25244',\n u'25504', u'25635', u'24868', u'26143', u'25313', u'45769',\n u'24870', u'25508', u'26323', u'24832', u'25202', u'26451',\n u'25637', u'26288', u'26656', u'25670', u'25550', u'25059',\n u'456197853', u'249011225', u'25303', u'45680', u'26155', u'25002',\n u'25387', u'251771047', u'263230278', u'256250601', u'246051700',\n u'25045', u'25085', u'25011', u'25136', u'26405', u'25241',\n u'26070', u'25075', u'259181310', u'26105', u'25253', u'25275',\n u'24811', u'26287', u'25669', u'25159', u'26833', u'26378',\n u'24850', u'45760', u'26519', u'22802', u'25039', u'25403',\n u'26425', u'25625', u'254254109', u'253099281', u'258821226',\n u'255609701', u'252761627', u'25545', u'26546', u'25674',\n u'255701081', u'25547', u'257021403', u'25555', u'25113',\n u'255609730', u'255089543', u'25909', u'250489721', u'25958',\n u'25831', u'25825', u'25701', u'258479621', u'267630283', u'26588',\n u'24945', u'254280359', u'257029632', u'254253549', u'24869',\n u'25203', u'24847', u'248440000', u'25425', u'24614', u'26807',\n u'253069761', u'28104', u'26525', u'24910', u'25361', u'259813804',\n u'24808', u'253027228', u'26601', u'25801', u'25702', u'26208',\n u'255249621', u'25652', u'25033', u'26416', u'24712', u'25444',\n u'32707', u'259621513', u'25644', u'26034', u'262419617', u'25917',\n u'26062', u'25169', u'24731', u'254434652', u'25314', u'24620',\n u'75092', u'25306', u'26385'], dtype=object)"
Depending on what your goal in "applying zipcode.isequal for each zips"...
To return a list where each element is the return value of zipcode.isequal() of the elements in zips:
cities = [zipcode.isequal(str(zip)) for zip in zips]
or return a list containing the elements in zips for which zipcode.isequal() returns true:
cities = [zip for zip in zips if zipcode.isequal(str(zip))]
Edit: Given that zips does not consist entirely of numeric strings, you probably need to do an additional filter on either one:
cities = [zipcode.isequal(str(zip)) for zip in zips if zip.isdigit()]
cities = [zip for zip in zips if zip.isdigit() and zipcode.isequal(str(zip))]
I need to iterate through two files many million times,
counting the number of appearances of word pairs throughout the files.
(in order to build contingency table of two words to calculate Fisher's Exact Test score)
I'm currently using
from itertools import izip
src=tuple(open('src.txt','r'))
tgt=tuple(open('tgt.txt','r'))
w1count=0
w2count=0
w1='someword'
w2='anotherword'
for x,y in izip(src,tgt):
if w1 in x:
w1count+=1
if w2 in y:
w2count+=1
.....
While this is not bad, I want to know if there is any faster way to iterate through two files, hopefully significantly faster.
I appreciate your help in advance.
I still don't quite get what exactly you are trying to do, but here's some example code that might point you in the right direction.
We can use a dictionary or a collections.Counter instance to count all occurring words and pairs in a single pass through the files. After that, we only need to query the in-memory data.
import collections
import itertools
import re
def find_words(line):
for match in re.finditer("\w+", line):
yield match.group().lower()
counts1 = collections.Counter()
counts2 = collections.Counter()
counts_pairs = collections.Counter()
with open("src.txt") as f1, open("tgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(find_words(line1))
words2 = list(find_words(line2))
counts1.update(words1)
counts2.update(words2)
counts_pairs.update(itertools.product(words1, words2))
print counts1["someword"]
print counts1["anotherword"]
print counts_pairs["someword", "anotherword"]
In general if your data is small enough to fit into memory then your best bet is to:
Pre-process data into memory
Iterate from memory structures
If the files are large you may be able to pre-process into data structures, such as your zipped data, and save into a format such as pickle that is much faster to load & work with in a separate file then process that.
Just as an out of the box thinking solution:
Have you tried making the files into Pandas data frames? I.e. I assume you already you make a word list out of the input (by removing reading signs such as . and ,) and using a input.split(' ') or something similar. That you can then make into DataFrames, perform a wordd count and then make a cartesian join?
import pandas as pd
df_1 = pd.DataFrame(src, columns=['word_1'])
df_1['count_1'] = 1
df_1 = df_1.groupby(['word_1']).sum()
df_1 = df_1.reset_index()
df_2 = pd.DataFrame(trg, columns=['word_2'])
df_2['count_2'] = 1
df_2 = df_2.groupby(['word_2']).sum()
df_2 = df_2.reset_index()
df_1['link'] = 1
df_2['link'] = 1
result_df = pd.merge(left=df_1, right=df_2, left_on='link', right_on='link')
del result_df['link']
I use stuff like this for basket analysis, works really well.