I sumed the ytd using this code:
df['YTD19'] = df['January 19']+df['February 19']+df['March 19']+df['April 19']+df['May 19'] + df['June 19']
df['YTD20'] = df['January 20']+df['February 20']+df['March 20']+df['April 20']+df['May 20'] + df['June 20']
But as a result, some rows (especially with null values) did not sum:
Could you please help me how to improve my code?
To improve your code, you can first replace white space with nan. Then, to create the YTD19, you can sum all the columns that contain '19' in their name, using filter(like= ...) - similar logic applies for YTD2020:
# replace empty string and records with only spaces
df = df.replace(r'^\s*$', np.nan, regex=True)
# create your 2 columns
df['YTD19'] = df.filter(like='19').sum(axis=1)
df['YTD20'] = df.filter(like='20').sum(axis=1)
>>> df[['Manufacturer','Category','Country','YTD19','YTD20']]
Manufacturer Category Country YTD19 YTD20
0 X Joist Czech 2910 2677.0
1 Y Joist Poland 3269 2366.0
2 Z Joist Slovakia 4204 2012.0
I know what was the problem:
I forgot to replace null values to zero, that is why it printed bad output. So the only thing needed is:
df.fillna(0)
Related
Python/pandas newbie here. The csv file I'm trying to work with has been populated with data that looks something like this:
A B C D
Option1(item1=12345, item12='string', item345=0.123) 2020-03-16 1.234 Option2(item4=123, item56=234, item678=345)
I'd like it to look like this:
item1 item12 item345 B C item4 item56 item678
12345 'string' 0.123 2020-03-16 1.234 123 234 345
In other words, I want to replace columns A and D with new columns headed by what's on the left of the equal sign, using what's to the right of the equal sign as the corresponding value, and with the Option1() and Option2() parts and the commas stripped out. The columns that don't contain functions should be left as is.
Is there an elegant way to do this?
Actually, at this point, I'd settle for any old way, elegant or not; I've found various ways of dealing with this situation if, say, there were dicts populating columns, but nothing to help me pick it apart if there are functions there. Trying to search for the answer only gives me a bunch of results for how to apply functions to dataframes.
As long as your functions always have the same arguments, this should work.
You can read the csv with (if separators are 2 or more spaces, that's what I get when I paste your question example):
df = pd.read_csv('test.csv',sep='[\s]{2,}', index_col=False, engine='python')
If your dataframe is df:
# break out both sides of the equal sign in function into columns
A_vals = df['A'].str.extractall(r'([\w\d]+)=([^,\)]*)')
# get rid of the multi-index and put the values after '=' into columns
A_converted = A_vals.unstack(level=-1)[1]
# set column names to values before '='
A_converted.columns = list(A_vals.unstack(level=-1)[0].values[0])
# same thing for 'D'
D_vals = df['D'].str.extractall(r'([\w\d]+)=([^,\)]*)')
D_converted = D_vals.unstack(level=-1)[1]
D_converted.columns = list(D_vals.unstack(level=-1)[0].values[0])
# join everything together
df = A_converted.join(df.drop(['A','D'], axis=1)).join(D_converted)
Some clarification on the regex '([\w\d]+)=([^,\)]*)' has two capture groups (each part in parens):
Group 1 ([\w\d]+) is one or more characters (+) that are word characters \w or numbers \d.
= between groups.
Group 2 ([^,\)]*) is 0 or more characters (*) that are not (^) a comma , or paren \).
I believe you're looking for something along these lines:
contracts = ["Option(conId=384688665, symbol='SPX', lastTradeDateOrContractMonth='20200116', strike=3205.0, right='P', multiplier='100', exchange='SMART', currency='USD', localSymbol='SPX 200117P03205000', tradingClass='SPX')",
"Option(conId=12345678, symbol='DJX', lastTradeDateOrContractMonth='20200113', strike=1205.0, right='P', multiplier='200', exchange='SMART', currency='USD', localSymbol='DJXX 333117Y13205000', tradingClass='DJX')"]
new_conts = []
columns = []
for i in range (len(contracts)):
mod = contracts[i].replace('Option(','').replace(')','')
contracts[i] = mod
new_cont = contracts[i].split(',')
new_conts.append(new_cont)
for contract in new_conts:
column = []
for i in range (len(contract)):
mod = contract[i].split('=')
contract[i] = mod[1]
column.append(mod[0])
columns.append(column)
print(len(columns[0]))
df = pd.DataFrame(new_conts,columns=columns[0])
df
Output:
conId symbol lastTradeDateOrContractMonth strike right multiplier exchange currency localSymbol tradingClass
0 384688665 'SPX' '20200116' 3205.0 'P' '100' 'SMART' 'USD' 'SPX 200117P03205000' 'SPX'
1 12345678 'DJX' '20200113' 1205.0 'P' '200' 'SMART' 'USD' 'DJXX 333117Y13205000' 'DJX'
Obviously you can then delete unwanted columns, change names, etc.
I have this table with models df['model'] and
pd.value_counts(df2['model'].values, sort=True)
returns this:
'''
MONSTER 331
MULTISTRADA 134
HYPERMOTARD 69
SCRAMBLER 63
SUPERSPORT 31
...
900 1
T-MAX 1
FC 1
GTS 1
SCOUT 1
Length: 75, dtype: int64
'''
I want to rename all the values in df2['model'] that have count <5 into 'OTHER'.
Please can anyone help me, how to go about this?
You first can get a list of the categories you want to change to other with the first line of code. It takes your functiona and selects the rows which meet the condicion you want (in this case less than 5 occurences).
Then you select the dataframe and just select the rows whose model cell is in the list of categories you want to substitute and change te value to 'OTHER'.
other_classes = data['model'].value_counts()[data['model'].value_counts() < 5].index
data['model'][data['model'].isin(other_classes)] = 'OTHER'
Hope it helps
I suspect it is not at all elegant or pythonic, but this worked in the end:
df_pooled_other = df_final.assign(freq=df_final.groupby('model name')['model name'].transform('count'))\
.sort_values(by=['freq','model name', 'Age in months_x_x'],ascending=[False,True, True])
df_pooled_other['model name'] = np.where(df_pooled_other['freq'] <= 5, 'Other', df_pooled_other['model name'])
Let's say there is column like below.
df = pd.DataFrame(['A-line B-station 9-min C-station 3-min',
'D-line E-station 8-min F-line G-station 5-min',
'G-line H-station 1-min I-station 6-min J-station 8-min'],
columns=['station'])
A,B,C is just arbitrary characters and there are whole bunch of rows like this.
station
0 A-line B-station 9-min C-station 3-min
1 D-line E-station 8-min F-line G-station 5-min
2 G-line H-station 1-min I-station 6-min J-stati...
How can we make columns like below?
Line1 Station1-1 Station1-2 Station1-3 Line2 Station2-1
0 A-line B-station C-station null null null
1 D-line E-station null null F-line G-station
2 G-line H-station I-station J-station null null
stationX-X means that Station (line number) - (order of station)
Station1-1 means first station for first line(line1)
Station1-2 means second station for first line(line1)
Station2-1 means first station for second line(line2)
I tried to split by delimiter; however, it doesn't work since every row has different number of lines and stations.
What I maybe need is to split columns based on their characters contained. For example, I could store first '-line' to Line1 and store first '-station' to station1-1.
Does anybody have any ideas how to do this?
Any small thoughts help me!
Thank you!
First create Series with Series.str.split and DataFrame.stack:
s = df['station'].str.split(expand=True).stack()
Then remove values ending with min by boolean indexing with Series.str.endswith:
df1 = s[~s.str.endswith('min')].to_frame('data').rename_axis(('a','b'))
Then create counters for lines and for station rows with filtering and GroupBy.cumcount:
df1['Line'] = (df1[df1['data'].str.endswith('line')]
.groupby(level=0)
.cumcount()
.add(1)
.astype(str))
df1['Line'] = df1['Line'].ffill()
df1['station'] = (df1[df1['data'].str.endswith('station')]
.groupby(['a','Line'])
.cumcount()
.add(1)
.astype(str))
Create Series with join, replace missing values by df1['Line'] by Series.fillna:
df1['station'] = (df1['Line'] + '-' + df1['station']).fillna(df1['Line'])
Reshape by DataFrame.set_index with DataFrame.unstack:
df1 = df1.set_index('station', append=True)['data'].reset_index(level=1, drop=True).unstack()
Rename columns names - not before for avoid wrong sorted:
df1 = df1.rename(columns = lambda x: 'Station' + x if '-' in x else 'Line' + x)
Remove columns name:
df1.columns.name = None
df1.index.name = None
print (df1)
Line1 Station1-1 Station1-2 Station1-3 Line2 Station2-1
0 A-line B-station C-station NaN NaN NaN
1 D-line E-station NaN NaN F-line G-station
2 G-line H-station I-station J-station NaN NaN
I have a dataframe which has some duplicate tags separated by commas in the "Tags" column, is there a way to remove the duplicate strings from the series. I want the output in 400 to have just Museum, Drinking, Shopping.
I can't split on a comma & remove them because there are some tags in the series that have similar words like for example: [Museum, Art Museum, Shopping] so splitting and dropping multiple museum strings would affect the unique 'Art Museum' string.
Desired Output
You can split by comma and convert to a set(),which removes duplicates, after removing leading/trailing white space with str.strip(). Then, you can df.apply() this to your column.
df['Tags']=df['Tags'].apply(lambda x: ', '.join(set([y.strip() for y in x.split(',')])))
You can create a function that removes duplicates from a given string. Then apply this function to your column Tags.
def remove_dup(strng):
'''
Input a string and split them
'''
return ', '.join(list(dict.fromkeys(strng.split(', '))))
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
DEMO:
import pandas as pd
my_dict = {'Tags':["Museum, Art Museum, Shopping, Museum",'Drink, Drink','Shop','Visit'],'Country':['USA','USA','USA', 'USA']}
df = pd.DataFrame(my_dict)
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
df
Output:
Tags Country
0 Museum, Art Museum, Shopping USA
1 Drink USA
2 Shop USA
3 Visit USA
Without some code example, I've thrown together something that would work.
import pandas as pd
test = [['Museum', 'Art Museum', 'Shopping', "Museum"]]
df = pd.DataFrame()
df[0] = test
df[0]= df.applymap(set)
Out[35]:
0
0 {Museum, Shopping, Art Museum}
One approach that avoids apply
# in your code just s = df['Tags']
s = pd.Series(['','', 'Tour',
'Outdoors, Beach, Sports',
'Museum, Drinking, Drinking, Shopping'])
(s.str.split(',\s+', expand=True)
.stack()
.reset_index()
.drop_duplicates(['level_0',0])
.groupby('level_0')[0]
.agg(','.join)
)
Output:
level_0
0
1
2 Tour
3 Outdoors,Beach,Sports
4 Museum,Drinking,Shopping
Name: 0, dtype: object
there maybe mach fancier way doing these kind of stuffs.
but will do the job.
make it lower-case
data['tags'] = data['tags'].str.lower()
split every row in tags col by comma it will return a list of string
data['tags'] = data['tags'].str.split(',')
map function str.strip to every element of list (remove trailing spaces).
apply set function return set of current words and remove duplicates
data['tags'] = data['tags'].apply(lambda x: set(map(str.strip , x)))
I have a column in pandas dataframe which has items like following,
SubBrand
Sam William Mathew
Jonty Rodes
Chris Gayle
I want to create a new column (SubBrand_new) such as
SubBrand_new
0 SWM
1 JR
2 CG
I am using this piece of code,
df1["SubBrand_new"] = "".join([x[0] for x in (df1["SubBrand"].str.split())])
but not able to get what I am looking for. Can anybody help?
We can do split with expand and sum i.e
df['SubBrand'].str.split(expand=True).apply(lambda x : x.str[0]).fillna('').sum(1)
0 SWM
1 JR
2 CG
dtype: object
You want to apply a function to every line and return a new column with its result. This kind of operation can be applied with the .apply() method, a simple = attribution will not do the trick. A solution in the spirit of your code would be:
df = pd.DataFrame({'Name': ['Marcus Livius Drussus',
'Lucius Cornelius Sulla',
'Gaius Julius Caesar']})
df['Abrev'] = df.Name.apply(lambda x: "".join([y[0] for y in (x.split())]))
Which yields
df
Name Abrev
0 Marcus Levius Drussus MLD
1 Lucius Cornelius Sulla LCS
2 Gaius Julius Caesar GJC
EDIT:
I compared it to the other solution, thinking that the apply() method with join() would be pretty slow. I was surprised to find that it is in fact faster. Setting:
N = 3000000
bank = pd.util.testing.rands_array(3,N)
vec = [bank[3*i] + ' ' + bank[3*i+1] + ' ' + bank[3*i+2] for i in range(N/3)]
df = pd.DataFrame({'Name': vec})
I find:
df.Name.apply(lambda x: "".join([y[0] for y in (x.split())]))
executed in 581ms
df.Name.str.split(expand=True).apply(lambda x : x.str[0]).fillna('').sum(1)
executed in 2.81s