Pandas - Iterating over an index in a loop - python

I have a weird interaction that I would need help with. Basically :
1)
I have created a pandas dataframe that containts 1179 rows x 6 columns. One column is street names and the same value will have several duplicates (because each line represents a point, and each point is associated with a street).
2)
I also have a list of all the streets in this panda dataframe.
3)If I run this line, I get an output of all the rows matching that street name:
print(sub_df[sub_df.AQROUTES_3=='AvenueMermoz'])
Result :
FID AQROUTES_3 ... BEARING E_ID
983 983 AvenueMermoz ... 288.058014
984 984 AvenueMermoz ... 288.058014
992 992 AvenueMermoz ... 288.058014
1005 1005 AvenueMermoz ... 288.058014
1038 1038 AvenueMermoz ... 288.058014
1019 1019 AvenueMermoz ... 288.058014
However, if I run this command in a loop with the string of my list as the street name, it returns an empty dataframe :
x=()
for names in pd_streetlist:
print(names)
x=names
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
x=()
Returns :
RangSaint_Joseph
Empty DataFrame
Columns: [FID, AQROUTES_3, X, Y, BEARING, E_ID]
Index: []
AvenueAugustin
Empty DataFrame
Columns: [FID, AQROUTES_3, X, Y, BEARING, E_ID]
Index: []
and so on...
I can't figure out why. Anybody has an idea?
Thanks

I believe the issue is in this line:
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
To each names you add unnecessarily quote characters at the beginning and at the end so that each valid name of the street (in your example 'AvenueMermoz' turns into "'AvenueMermoz'" where we had to use double quotes to enclose single-quoted string).
As #busybear has commented - there is no need to cast to str either. So, the corrected line would be:
print(sub_df[sub_df.AQROUTES_3 == x])

So youre adding quotation marks to the filter which you shouldnt. now youre filtering on 'AvenueMermoz' while you just want to filter on AvenueMermoz .
so
print(sub_df[sub_df.AQROUTES_3 =="'"+str(x)+"'"])
should become
print(sub_df[sub_df.AQROUTES_3 ==str(x)])

Related

How to append from a dataframe to a list?

I have a dataframe called pop_emoj that has two columns (one for the emoji, and one for the emoji count) as seen below.
☕ 585
🌭 193
🌮 186
🌯 85
🌰 53
🌶 124
🌽 138
🍄 46
🍅 170
🍆 506
I have sorted the df based on the counts in descending order as seen below.
emoji_updated = pop_emoj.head(105).sort_values(ascending=False)
🍻 1809
🎂 1481
🍔 1382
🍾 1078
🥂 1028
And I'm trying to use the top n emojis to append to a new list called top_list, but I am getting stuck. Here is my code so far.
def top_number_of_emojis(n):
top_list = []
top_list = emoji_updated[0].tolist()
return top_list
I'm wanting to take all of column 1 (the emojis) and append them to my list (top_number_of_emojis). The output should look like this:
top_number_of_emojis(1) == ['🍻']
top_number_of_emojis(2) == ['🍻', '🎂']
top_number_of_emojis(3) == ['🍻', '🎂', '🍔']
if you already got the top 5 emojis, you just need to save them to a list.
An option for that is iterrows.
top_list = []
for id, emoji in emoji_updated.iterrows():
top_list.append(emoji)

Convert number string with commas and negative values to float [Pandas]

I would like to convert negative value strings and strings with commas to float
df. But I am struggling to do both operations at the same time
customer_id Revenue
332 1,293.00
293 -485
4284 1,373.80
284 -327
Output_df
332 1293.00
293 485
4284 1373.80
284 327
Convert to numeric and then take the absolute value:
df["Revenue"] = pd.to_numeric(df["Revenue"]).abs()
If the above doesn't work, then try:
df["Revenue"] = pd.to_numeric(df["Revenue"].str.strip().str.replace(",", "")).abs()
Here I first make a call to str.strip() to remove any whitespace in your float. Then, I remove commas using str.replace().
Does using .str.replace() help?
df["Revenue"] = pd.to_numeric(df["Revenue"].str.replace(',','').abs()
If you are getting the DataFrame from a csv file, you can use the following at import to address the commas, and then deal with the - later:
df.read_csv ('foo.​csv', thousands=',')
df["Revenue"] = pd.to_numeric(df["Revenue"]).abs()

Splitting a Column using Pandas

I am trying to split the following column using Pandas: (df name is count)
Location count
POINT (-118.05425 34.1341) 355
POINT (-118.244512 34.072581) 337
POINT (-118.265586 34.043271) 284
POINT (-118.360102 34.071338) 269
POINT (-118.40816 33.943626) 241
to this desired outcome:
X-Axis Y-Axis count
-118.05425 34.1341 355
-118.244512 34.072581 337
-118.265586 34.043271 284
-118.360102 34.071338 269
-118.40816 33.943626 241
I have tried removing the word 'POINT', and both the brackets. But then I am met with an extra white space at the beginning of the column. I tried using:
count.columns = count.columns.str.lstrip()
But it was not removing the white space.
I was hoping to use this code to split the column:
count = pd.DataFrame(count.Location.str.split(' ',1).tolist(),
columns = ['x-axis','y-axis'])
Since the space between both x and y axis could be used as the separator, but the white space.
You can use .str.extract with regex pattern having capture groups:
df[['x-axis', 'y-axis']] = df.pop('Location').str.extract(r'\((\S+) (\S+)\)')
print(df)
count x-axis y-axis
0 355 -118.05425 34.1341
1 337 -118.244512 34.072581
2 284 -118.265586 34.043271
3 269 -118.360102 34.071338
4 241 -118.40816 33.943626
a quick solution can be:
(df['Location']
.str.split(' ', 1) # like what you did,
.str[-1] # select only lat,lon
.str.strip('(') # remove open curly bracket
.str.strip(')') # remove close curly bracket
.str.split(' ', expand=True)) # expand to two columns
then you may rename column names using .rename or df.columns = colnames

How to compare pairs of values in two dataframes of different sizes in python?

I have two dataframes of different sizes:
sdfn with columns 'ConceptID1' and ConceptID2'
ConceptID1 ConceptID2
0 5743 4513
1 5743 7099
2 4513 7099
3 10242 7042
4 10242 7099
... ... ...
2601 12028 12043
2602 12371 12043
2603 266632 54106
2604 266632 51135
2605 54106 51135
jdfn with columns 'Gene1' and 'Gene2'
Gene1 Gene2
0 1535 353
1 9970 332
2 23581 112401
3 846 112401
4 150160 112401
.. ... ...
384 79626 51284
385 79626 51311
386 7305 51311
387 80342 79626
388 7305 79626
Comparing through both data frames, I need to find matching pairs.
I tried this
for index, row in sdfn.iterrows():
for index, row in jdfn.iterrows():
if ((sdfn['ConceptID1']==jdfn['Gene1']) and (sdfn['ConceptID2']==jdfn['Gene2'])) or (sdfn['ConceptID1']==jdfn['Gene2']) and ((sdfn['ConceptID2']==jdfn['Gene1'])):
print(sdfn['ConceptID1'], jdfn['Gene1'], sdfn['ConceptID2'], jdfn['Gene2'])
The result:
Traceback (most recent call last):
File "", line 3, in
if ((sdfn['ConceptID1']==jdfn['Gene1']) and (sdfn['ConceptID2']==jdfn['Gene2'])) or
(sdfn['ConceptID1']==jdfn['Gene2']) and
((sdfn['ConceptID2']==jdfn['Gene1'])): File
"/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/ops/init.py",
line 1142, in wrapper
raise ValueError("Can only compare identically-labeled " "Series objects")
ValueError: Can only compare identically-labeled Series objects
The issue here is that you are not using or naming your for loop variables correctly and attempting to compare the entirety of each dataframe column directly.
sdfn['ConceptID1'], sdfn['ConceptID2'], jdfn['Gene1'], jdfn['Gene2']
will refer to the entire dataframe column, which pandas defines as a Series type object, hence the mention of Series label mismatch in the error message.
You will need to first rename your for loop variables, and then use them in the search:
for sind, srow in sdfn.iterrows():
for jind, jrow in jdfn.iterrows():
if ((srow['ConceptID1']==jrow['Gene1']) and (srow['ConceptID2']==jrow['Gene2'])) or (srow['ConceptID1']==jrow['Gene2']) and ((srow['ConceptID2']==jrow['Gene1'])):
print(srow['ConceptID1'], jrow['Gene1'], srow['ConceptID2'], jrow['Gene2'])
Note that in your posted code, index and row variables are declared and assigned in the outer loop yet are modified in the inner loop. So instead of having two pairs of loop variables, there is only one pair that is being incremented and overwritten, thus unable to compare the appropriate data.
Hope this helps!

Python: "in" does not recognize values in DataFrame column

I have an excerpt from a DataFrame "IRAData" and a Column called 'Labels':
380 u'itator-Research'
381 u'itator-OnSystem'
382 u'itator-QueryClient'
383 u'itator-OnSystem'
384 u'itator-OnSystem'
385 u'itator-OnSystem'
386 u'itator-OnSystem'
387 u'itator-OnSystem'
388 u'itator-OnSystem'
Name: Labels, dtype: object
But when I run the following code, I get "False" output:
print(u'itator-QueryClient' in IRAData['Labels'])
Same goes for the other values in the column and when I remove the unicode 'u'.
Anyone have an idea as to why?
EDIT: The solution that I placed in a comment below worked. Did not need to attempt the answer to the suggested duplicate question.
I think the best way to avoid this problems is to correctly import the data.
You store "u'itator-QueryClient'" where u is a raw marker of unicode string,
when 'itator-QueryClient' is the good information to store here.
For example from this html page, just select and copy the lines 381 to 384 an invoque :
In [498]: import ast
In [499]: pd.read_clipboard(names=['value'],index_col=0,header=None,\
converters={'value': ast.literal_eval})
Out[499]:
value
381 itator-OnSystem
382 itator-QueryClient
383 itator-OnSystem
384 itator-OnSystem
Then 'itator-QueryClient' in IRAData['value'] will be evaluated to True.

Categories

Resources