I'm reading a df from csv that has 2 columns showing the prices of various items. In some cases the price is a single int/float, but other cases it could be a range of spaces seperated int/floats or mixture of int/floats with strings.
example df:
item prices
------ ---------------------------
a 2
b 3.5
c 5
d 0.04
e 1 8 3 4 2
f 0.04 0.04 0.01
g Normal: 4.56Premium: 4.75
What I'm looking is a nice pythonic way to get the prices column to display the lowest possible int/float value for every item. e.g.
item prices
------ --------
a 2
b 3.5
c 5
d 0.04
e 1
f 0.01
g 4.56
The only way I could think of solving this problem for items e and f would be to split the value by using str.split(" ") and mapping the output to int or float, but this seems like it would be messy as not all values are the same types. And I don't even know how I would get the lowest value for item g.
any help would be appreciated
Use Series.str.extractall for get integers or floats, convert to floats and get minimal values:
df['prices'] = (df['prices'].str.extractall('(\d+\.\d+|\d+)')[0]
.astype(float)
.groupby(level=0)
.min())
print (df)
item prices
0 a 2.00
1 b 3.50
2 c 5.00
3 d 0.04
4 e 1.00
5 f 0.01
6 g 4.56
Related
I have some data that looks like this:
agent external id,product_group,commission_rate,weeks_held
3000-29,Collection,0.85,1
3000-29,Collection,0.85,2
3000-29,Return,0.85,1
3000-12,Collection,0.85,1
3000-12,Collection,0.85,2
3000-12,Return,0.85,1
3000-34,Collection,0.8,2
3000-34,Collection,0.8,2
3000-34,Return,0.8,1
3022-29,Collection,0.75,1
3022-29,Collection,0.75,2
3022-29,Return,0.75,1
I'm trying to create a dataframe that loops over each agent external id and gives me each as separate columns:
The agent external id
The count() of each product group
The sum() of weeks held by each product group
The output I'm looking for is:
agent_external_id,count_collection,sum_collection_weeks_held,count_return,sum_return_weeks_held
3000-29,2,3,0,0
3000-29,0,0,1,1
3000-12,2,3,0,0
3000-12,0,0,1,1
...
I've tried all types of groupby() but I'm not able to structure the data in the way that I want.
Is this what you would like. Please give a sample output and someone will help.
df.groupby(['agent_external_id','product_group']).agg(['sum','count'])
commission_rate weeks_held
sum count sum count
agent_external_id product_group
3000-12 Collection 1.70 2 3 2
Return 0.85 1 1 1
3000-29 Collection 1.70 2 3 2
Return 0.85 1 1 1
3000-34 Collection 1.60 2 4 2
Return 0.80 1 1 1
3022-29 Collection 1.50 2 3 2
Return 0.75 1 1 1
I have a column which may contain values like abc,def or abc,def,efg, or ab,12,34, etc. As you can see, some values end with a , and some don't. What I want to do is remove all such values that end with a comma ,.
Assuming the data is loaded and a data frame is created. So this is what I do
df[c] = df[c].astype('unicode').str.replace("/,*$/", '').str.strip()
But it doesn't do anything.
What am I doing wrong?
The way you were trying to do it, would be something like this:
df[c] = df[c].str.rstrip(',')
rstrip(',') will remove comma just from the end of the string.
strip(',') will remove it from start and end positions both.
The above will replace the text. It will not let you drop the rows from the dataframe. So you should do below:
Use str.endswith:
df[~df['col'].str.endswith(',')]
Consider below df:
In [1547]: df
Out[1547]:
date id value rolling_mean col
0 2016-08-28 A 1 nan a,
1 2016-08-28 B 1 nan b
2 2016-08-29 C 2 nan c,
3 2016-09-02 B 0 0.50 d
4 2016-09-03 A 3 2.00 ee,ff
5 2016-09-06 C 1 1.50 gg,
6 2017-01-15 B 2 1.00 i,
7 2017-01-18 C 3 2.00 j
8 2017-01-18 A 2 2.50 k,
In [1548]: df = df[~df['col'].str.endswith(',')]
In [1549]: df
Out[1549]:
date id value rolling_mean col
1 2016-08-28 B 1 nan b
3 2016-09-02 B 0 0.50 d
4 2016-09-03 A 3 2.00 ee,ff
7 2017-01-18 C 3 2.00 j
Your regex is wrong as it contains regex delimiter characters. Python regex uses plain strings, not regex literals.
Use
df[c] = df[c].astype('unicode').str.replace(",+$", '').str.strip()
The ,+$ will match one or more commas at the end of string.
See proof.
Also, see Regular expression works on regex101.com, but not on prod
After performing the groupby on two columns (id and category) using the mean aggregation function over a column (col3) I have something like this:
col3
id category mean
345 A 12
B 2
C 3
D 4
Total 21
What I would like to do is to add a new column called percentage in which I calculate the percentage of each category over the category Total.
This should be done separately for every id.
The result should be something like this:
col3
id category mean percentage
345 A 12 0.57
B 2 0.09
C 3 0.14
D 4 0.19
Total 21 1
Obviously i want to do that for every id, that is the first column on which i have done the groupby. Any suggestion on how to do that?
Using get_level_values filter your df, then we using div
s=df[df.index.get_level_values(level=1)!='Total'].sum(level=0)
df['percentage']=df.div(s,level=0,axis=1)
df
Out[422]:
mean percentage
id category
345 A 12 0.571429
B 2 0.095238
C 3 0.142857
D 4 0.190476
Total 21 1.000000
That's my suggestion:
df['mean'] = df['mean'] / df['mean'].sum()
I have pairs of categorical data but I don't want to double count instances where "toy" and "B" for instance are together multiple times.
I can do a pivot table with counts, but what I want is the equivalent of that with 1 or 0 depending if ANY matched that combo of 2 values or not, not the number of matches, 2, 3, 4, etc.
Here is an example input:
RS232,1.8,focused,C
RS233,2.8,chew,E
RS234,3.8,toy,D
RS235,4.8,poodle,C
RS236,5.8,winding,E
RS237,6.8,up,D
RS238,7.8,focused,B
RS239,9.8,chew,B
RS240,7.8,toy,B
RS241,6.8,toy,B
RS242,5.8,toy,A
RS243,4.8,focused,A
RS244,9.8,chew,A
RS245,8.8,chew,A
RS246,7.8,chew,C
RS247,6.8,winding,C
RS248,5.8,winding,C
RS249,4.8,winding,D
RS250,3.8,toy,D
The number field doesn't matter other than an earlier filtering step. But I only want to count RS244 and RS245 as a single count in the bar plot, as making that combo twice just means people tried it a lot, not that multiple occurrences has any special meaning.
I eventually got to this data which I plotted:
attrib2 group count
0 chew A 2
1 chew B 1
2 chew C 1
3 chew E 1
4 focused A 1
5 focused B 1
6 focused C 1
7 poodle C 1
8 toy A 1
9 toy B 2
10 toy D 2
11 up D 1
12 winding C 2
13 winding D 1
14 winding E 1
note duplicate pairs have a count > 1, but for plotting I use .value_counts so I ignore the count field and just plot how many UNIQUE items each element of attrib2 was paired with. The Histogram I want is simply the count of the number of times each element is listed in the attrib2 column above.
The crude way I did it is this - surely there must be a cleaner, more pythonic way to accomplish this?
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import interactive
df= pd.read_csv('out.txt',sep=',',engine='c',lineterminator='\n',header='infer')
# # I am getting group/attrib2 pairs, but I want my plot to be against attrib2
groupout3 = df.groupby(['attrib2']).group.value_counts().sort_index()
# # groupby gives multiple counts for same combination, so set to 1 or leave as 0
# # following line not needed since I use value_counts below so it counts 1 if there is something there, regardless of the value, so 1, 2, etc. all get counted as 1 and 0 is 0
# #groupout3[groupout3 != 0 ] = 1
# #convert back to DataFrame for plotting
dfgroup = groupout3.to_frame('count')
# #make index back to column name
dfgroup.reset_index(level=['group','attrib2'], inplace=True)
# #plot categorical data counting
plt.figure(); dfgroup.attrib2.value_counts().plot(kind='bar')
plt.show()
surely there is a more elegant way to do this?
Thanks!
IIUC you can do it this way:
(df.groupby(['attrib2','group'])
.size()
.reset_index()
.groupby('attrib2')
.size()
.plot.bar(rot=0)
)
data:
In [85]: df
Out[85]:
attrib num attrib2 group
0 RS232 1.8 focused C
1 RS233 2.8 chew E
2 RS234 3.8 toy D
3 RS235 4.8 poodle C
4 RS236 5.8 winding E
5 RS237 6.8 up D
6 RS238 7.8 focused B
7 RS239 9.8 chew B
8 RS240 7.8 toy B
9 RS241 6.8 toy B
10 RS242 5.8 toy A
11 RS243 4.8 focused A
12 RS244 9.8 chew A
13 RS245 8.8 chew A
14 RS246 7.8 chew C
15 RS247 6.8 winding C
16 RS248 5.8 winding C
17 RS249 4.8 winding D
18 RS250 3.8 toy D
I'm trying to merge row in a dataframe where I have different inputs for one ID, so I'd like to have a single row for each ID with a weight.
My dataframe looks like this:
ID A B C D weight
1 0.5 2 a 1 1.0
2 0.3 3 b 2 0.35
2 0.6 5 c 3 0.55
3 0.4 2 d 4 0.9
and I would need it to merge the A, B columns for ID=2 into a weighted average (0.3*0.35+0.6*0.55 for A, 3*0.35+5*0.55 for B). For column C I'd need to chose the value associated to the highest weight (C=c for ID=2), column D the maximum value (D=3 in this case) and the final weight as the sum of all weights (0.35+0.55). Basically, I need to assign several different rules to each row for duplicate ID's, and I haven't found how to do this.
I'm using python I believe pandas is the best for this, but I'm just a beginner here, so I'll listen and try anything you suggest!
Thanks a lot!
import pandas as pd
a = pd.read_clipboard()
def agg_func(x):
x.A = x.A*x.weight
x.B = x.B*x.weight
return pd.Series([x.A.sum(), x.B.sum(), x.C[x.weight.idxmax()], x.D.max(), x.weight.max()], index=x.columns[1:])
print(a.groupby('ID').apply(agg_func))
A B C D weight
ID
1 0.500 2.0 a 1 1.00
2 0.435 3.8 c 3 0.55
3 0.360 1.8 d 4 0.90
This should do the job check http://pandas.pydata.org/pandas-docs/stable/groupby.html to know more.