python pandas with parameter length changing - python

I want to do
df[(df['col']==50) | (df['col']==150) | etc ..]
"etc" is size changing from 1 to many
so I do a loop
result is like
str= "(df['col']==50) | (df['col']==150) | (df['col']==100)"
then I do this
df[str]
but this does not work
How can I make it work ?

A simple solution:
list_of_numbers = [50,150]
df[df["col"].isin(list_of_numbers)]
Where list_of_numbers are the numbers you want to include in the condition. I'm assuming here your condition is always or.

Use query to filter a dataframe from a string
df = pd.DataFrame({'col': range(25, 225, 25)})
l = [50, 100, 150]
q = ' | '.join([f"col == {i}" for i in l])
out = df.query(f)
>>> q
'col == 50 | col == 100 | col == 150'
>>> out
col
1 50
3 100
5 150

Related

Pandas DataFrame string replace followed by split and set intersection

I have following pandas DataFrame
data = ['18#38#123#23=>21', '18#38#23#55=>35']
d = pd.DataFrame(data, columns = ['rule'])
and I have list of integers
r = [18, 55]
and I want to filter rules from above DataFrame if all integers in the list r are present in the rule too. I tried the following code and failed
d[d['rule'].str.replace('=>','#').split('#').astype(set).issuperset(set(r))]
How can I achieve the desired filtering with pandas
You were going in right direction, just need to use apply function instead:
d[d['rule'].str.replace('=>','#').str.split('#').apply(lambda x: set(x).issuperset(set(map(str,r))))]
Using str.get_dummies
d.rule.str.replace('=>','#').str.get_dummies(sep='#').loc[:, map(str, r)].all(1)
Outputs
0 False
1 True
dtype: bool
Detail:
get_dummies+loc returns
18 55
0 1 0
1 1 1
My initial instinct would be to use a list comprehension:
df = pd.DataFrame(['18#38#123#23=>21', '188#38#123#23=>21', '#18#38#23#55=>35'], columns = ['rule'])
def wrap(n):
return r'(?<=[^|^\d]){}(?=[^\d])'.format(n)
patterns = [18, 55]
pd.concat([df['rule'].str.contains(wrap(pattern)) for pattern in patterns], axis=1).all(axis=1)
Output:
0 False
1 False
2 True
My approach is similar to #RafaelC's answer, but convert all string into int:
new_df = d.rule.str.replace('=>','#').str.get_dummies(sep='#')
new_df.columns = new_df.columns.astype(int)
has_all = new_df[r].all(1)
# then you can assign new column for initial data frame
d['new_col'] = 10
d.loc[has_all, 'new_col'] = 100
Output:
+-------+-------------------+------------+
| | rule | new_col |
+-------+-------------------+------------+
| 0 | 18#38#123#23=>21 | 10 |
| 1 | 188#38#23#55=>35 | 10 |
| 2 | 18#38#23#55=>35 | 100 |
+-------+-------------------+------------+

Using values of a list stored in DataFrame cell in Pandas

I have a CSV file with each cell value a two element list(pair).
| 0 | 1 | 2 |
----------------------------------------
0 |[87, 1.03] | [30, 4.05] | NaN |
1 |[34, 2.01] | NaN | NaN |
2 |[83, 0.2] | [18, 3.4] | NaN |
How do I access the elements of these, separately? The first element of each pair acts as an index for another CSV table.
I have done something like this, but this keeps bugging me on one thing or other.
links = pd.read_csv('buslinks.csv', header = None)
a_list = []
for i in range(0, 100):
l = []
a_list.append(l)
for j in range(0, 100):
a = busStops.iloc[j]
df = pd.DataFrame(columns = ['id', 'Distance'])
l = links.iloc[j]
for i in l:
if(pd.isnull(i)):
continue
else:
x = int(i[0])
d = busStops.iloc[x-1]
id = d['id']
dist = distance(d['xCoordinate'], a['xCoordinate'], d['yCoordinate'], a['yCoordinate'])
df.loc[i] = [id, dist]
a_list[j] = (df.sort('Distance', ascending = True)).tolist()
This approach worked when each cell contained only one element. In that case, np.isnan() was used instead of pd.isnull()
The read CSV file was created as:
a_list = []
for i in range(0, 100):
l = []
a_list.append(l)
for i in range(0, 100):
while(len(a_list[i])<3):
x = random.randint(1, 100)
if(x-1 == i):
continue
a = busStops.iloc[i]
b = busStops.iloc[x-1]
dist = distance(a['xCoordinate'], b['xCoordinate'], a['yCoordinate'], b['yCoordinate'])
if dist>3:
continue
if x in a_list[i]:
continue
a_list[i].append([b['id'], dist])
a_list[x-1].append([a['id'], dist])
for j in range(0, 3):
y = random.randint(0, 1)
while (y == 0):
x = random.randint(1, 100)
if(x-1 == i):
continue
a = busStops.iloc[i]
b = busStops.iloc[x-1]
dist = distance(a['xCoordinate'], b['xCoordinate'], a['yCoordinate'], b['yCoordinate'])
if dist>3:
continue
if x in a_list[i]:
continue
a_list[i].append([b['id'], dist])
a_list[x-1].append([a['id'], dist])
y = 1
dfLinks = pd.DataFrame(a_list)
dfLinks
dfLinks.to_csv('buslinks.csv', index = False, header = False)
BusStops is yet another CSV file, that contains id, xCoordinate, yCoordinate, Population and Priority as columns.
First of all, beware that storing lists in DataFrames dooms you to Python-speed loops. To take advantage of fast Pandas/NumPy routines, you need to use native NumPy dtypes such as np.float64 (whereas, in contrast, list require "object" dtype).
That being said, here is my code i wrote just to show how to do it so you can use something like that in your code:
import pandas as pd
table = pd.DataFrame(columns=['col1', 'col2', 'col3'])
table.loc[0] = [1, 2,3]
table.loc[1] = [1, [2,3], 4]
table.loc[1].iloc[1] # returns [2, 3]
table.loc[1].iloc[1][0] # returns 2
You shouldn't be putting lists in pd.Series objects. It's inefficient and you lose all vectorised functionality. If, however, you are determined that this must be your starting point, you can unravel the lists into multiple columns in a couple of steps.
Setup
df = pd.DataFrame({0: [[87, 1.03], [34, 2.01], [83, 0.2]],
1: [[30, 4.05], np.nan, [18, 3.4]],
2: [np.nan, np.nan, np.nan]})
Step 1: ensure lists have same size
# messy way to ensure all values have length 2
df[1] = np.where(df[1].isnull(), pd.Series([[np.nan, np.nan]]*len(df[1])), df[1])
print(df)
0 1 2
0 [87, 1.03] [30, 4.05] NaN
1 [34, 2.01] [nan, nan] NaN
2 [83, 0.2] [18, 3.4] NaN
Step 2: concatenate dataframes of split series
# create list of dataframes
L = [pd.DataFrame(df[col].values.tolist()) for col in df]
# concatenate dataframes in list
df_new = pd.concat(L, axis=1, ignore_index=True)
print(df_new)
0 1 2 3 4
0 87 1.03 30.0 4.05 NaN
1 34 2.01 NaN NaN NaN
2 83 0.20 18.0 3.40 NaN
You can then access values as you would normally, e.g. df_new[2].

How to get average of increasing values using Pandas?

I'm trying to figure out the average of increasing values in my table per column.
my table
A | B | C
----------------
0 | 5 | 10
100 | 2 | 20
50 | 2 | 30
100 | 0 | 40
function I'm trying to write for my problem
def avergeIncreace(data,value): #not complete but what I have so far
x = data[value].pct_change().fillna(0).gt(0)
print( x )
pct_change() returns a table of the percentage of the number at that index compared to the number in row before it.fillna(0) replaces the NaN in position 0 of the chart that pct_change() creates with 0.gt(0) returns true or false table depending if the value at that index is greater than 0
current output of this function
In[1]:avergeIncreace(df,'A')
Out[1]: 0 False
1 True
2 False
3 True
Name: BAL, dtyle: bool
desired output
In[1]:avergeIncreace(df,'A')
Out[1]:75
In[2]:avergeIncreace(df,'B')
Out[2]:0
In[3]:avergeIncreace(df,'C')
Out[3]:10
From my limited understanding of pandas there should be a way to return an array of all the indexes that are true and then use a for loop and go through the original data table, but I believe pandas should have a way to do this without a for loop.
what I think the for loop way would look plus missing code so indexes returned are ones that are true instead of every index
avergeIncreace(df,'A')
indexes = data[value].pct_change().fillna(0).gt(0).index.values #this returns an array containing all of the index (true and false)
answer = 0
times = 0
for x in indexes:
answer += (data[value][x] - data[value][x-1])
times += 1
print( answer/times )
How to I achieve my desired output without using a for loop in the function?
You can use mask() and diff():
df.diff().mask(df.diff()<=0, np.nan).mean().fillna(0)
Yields:
A 75.0
B 0.0
C 10.0
dtype: float64
How about
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [0, 100, 50, 100],
'B': [5, 2, 2, 0],
'C': [10, 20, 30, 40]})
def averageIncrease(df, col_name):
# Create array of deltas. Replace nan and negative values with zero
a = np.maximum(df[col_name] - df[col_name].shift(), 0).replace(np.nan, 0)
# Count non-zero values
count = np.count_nonzero(a)
if count == 0:
# If only zero values… there is no increase
return 0
else:
return np.sum(a) / count
print(averageIncrease(df, 'A'))
print(averageIncrease(df, 'B'))
print(averageIncrease(df, 'C'))
75.0
0
10.0

Python - Place a character in between a formatted string

Lets say I want to print out
Item 1 | Item 2
A Third item | #4
Which can be done without the | by doing
print('%s20 %20s' % (value1,value2))
How would I go about placing the | character so that it is evenly justified between the two formatted values?
I suppose I could manually could the length of the formatted string without the | character and then insert the | in the middle but I am hoping for a more elegant solution!
Thank you very much for your time.
Edit: Here is a solution that I suggested would be possible
def PrintDoubleColumn(value1, value2):
initial = '%s20 %20s' % (value1,value2)
lenOfInitial = len(initial)
print(initial[:lenOfInitial/2] +'|' + initial[lenOfInitial/2+1:])
There is a good source for string format operations: https://pyformat.info/#string_pad_align
x = [1, 2, 3, 4, 5]
y = ['a', 'b', 'c', 'd', 'e']
for i in range(0, 5):
print "{0:<20}|{1:>20}".format(x[i], y[i])
Result:
1 | a
2 | b
3 | c
4 | d
5 | e

Add +1 to each item in a comma-separated string in pandas dataframe

I have a pandas dataframe structured as follows:
| ID | Start | Stop |
________________________________________
| 1 | 1,2,3,4 | 5,6,7,7 |
| 2 | 100,101 | 200,201 |
For each row in the dataframe, I'd like to add 1 to each value in the Start column. The dtype for the Start column is 'object'.
Desired output looks like this:
| ID | Start | Stop |
________________________________________
| 1 | 2,3,4,5 | 5,6,7,7 |
| 2 | 101,102 | 200,201 |
I've tried the following (and many versions of the following), but get an error stating ,TypeError: cannot concatenate 'str' and 'int' objects,:
df['test'] = [str(x + 1) for x in df['Start']]
I tried casting the column as an int, but got 'Invalid literal for long() with base 10: '101,102':
df['test'] = [int(x) + 1 for x in df['start'].astype(int)]
I tried converting the field to a list using str.split(), then casting each item as an integer:
Thanks in advance!
df['Start'] is the whole series, so you have to iterate that and then split:
new_series = []
for x in df['Start']:
value_list = []
for y in x.rstrip(',').split(','):
value_list.append(str(int(y) + 1))
new_series.append(','.join(value_list))
df['test'] = new_series
By telling you that you cannot concatenate string and int objects you know that x must be a string. You can solve this by casting x to an int before adding 1 to it. So str(x+1) becomes str(int(x)+1).
df['test'] = [str(int(x) + 1) for x in df['Start']]
df = pd.DataFrame({'Start' : [ [1 , 2, 3 , 4] , [100 , 101] ] , 'Stop' : [ [5 , 6 , 7 ,7] , [200,201] ] })
df.Start = df.Start.apply(lambda x : [y + 1 for y in x ])

Categories

Resources