I have a dataframe that I have created by hand. I am working on a code that copies the dataframe and concatenates the new dataframe to the end of the first one. For now, I need the code to look through each value of a column of the 'Name' dataframe that contains strings and if there is a number in the string, increase this number by 1. I need the number to be turned into an int so that I can create a function that will look through the dataframe and automatically add 1 to the largest number in the dataframe. An example:
import pandas as pd
data = {'ID': [1,2,3,4],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp']}
df = pd.DataFrame(data)
df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
Afterwards the new df looks like
data2 = {'ID': [1,2,3,4,5,6,7,8],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp','BN #2', 'HHC', 'A comp', 'B Comp']}
When I run this, I receive a 'NoneType' object is not subscriptable error. This makes sense because only the BN # row has a number and re.search returns None when the string parameters are not met, but I cannot figure out how to tell python to ignore the other rows.
EDIT
Only the first row each dataframe will increase by 1, so if there is an easier way where I do not use re.search, that is fine. I know there are a couple ways of doing this but I want to be able to always look through the string value of BN and increase it by 1 every time I run the code.
REGEX EDIT
df2['BaseName'] = [re.sub('\d', '', x) for x in df2['Name'].values]
df['BaseName'] = [re.sub('\d', '', x) for x in df['Name'].values]
df2['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df2['Name'].values]
# df2['SysNum'] = df2['Name'].get(r'(?<=#)\d').astype(int)
# df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
df['SysNum'] = df['Name'].str.contains('(?<=#)\d').astype(int)
m = re.search(r'(?<=#)\d', df2['Name'].iloc[0])
if m:
df2['SysNum'] = int(m.group(0)) + 1
n = re.search(r'(?<=#)\d', df['Name'].iloc[0])
if n:
df['SysNum'] = int(n.group(1)) + 1
new_names = df2['BaseName'].unique()
maxes2 = np.zeros((len(new_names), ))
for j in range(len(new_names)):
un2 = new_names[j]
maxes2[j] = df['SysNum'].loc[df['BaseName'] == un2].max()
df2['SysNum'].loc[df2['BaseName'] == un2] = np.linspace(1, len(df2['SysNum'].loc[df2['BaseName'] == un2]), len(df2['SysNum'].loc[df2['BaseName'] == un2]))
df2['SysNum'].loc[df2['BaseName'] == un2] += maxes2[j]
newnames2 = [s + '%d' % num for s,num in zip(df2['BaseName'].loc[df2['BaseName'] == un2].values, df2['SysNum'].loc[df2['BaseName'] == un2].values)]
df2['Name'].loc[df2['BaseName'] == un2] = newnames2
I have this code working for two dataframes and the numbering works out how I would like it to. The first two have a "Name-###" naming convention for all the rows in the dataframe. This allows the commented out re.search line at the top to run just fine. The next two dataframes I am working on are like the examples I put up earlier with the BN #1 and the rest of the names do not have a number. When I run the commented out re.search lines, the code tries to convert the NoneTypes to int and it cannot do that. When I run the code as is now, a new number is put on each and every row immediately following the name, but I need it to add a new number to the row with the #. So what I need and I am struggling with is a piece of code that looks through the dataframe, looks for a # sign, turns the number after the # sign into an int, a loop that looks for the max int and then adds 1 to that number, adds that new number onto the new dataframe, adds new dataframe onto the old one for a larger master list.
You can access the value on the first row of the Name column using df['Name'].iloc[0].
Thus, you can search for a sequence of digits after a # sign in that value using
m = re.search(r'#(\d+)', df['Name'].iloc[0])
if m:
df['SysNum'] = int(m.group(1)) + 1
Output:
>>> df
ID Name SysNum
0 1 BN #1 2
1 2 HHC 2
2 3 A comp 2
3 4 B Comp 2
I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]
I have a large .csv file that has 11'000'000 rows and 3 columns: id ,magh , mixid2.
What I have to do is to select the rows with the same id and then check if these rows have the same mixid2; if True I remove the rows, If False I initialize a class with the information of the selected rows.
That is my code:
obs=obs.set_index('id')
obs=obs.sort_index()
#dropping elements with only one mixid2 and filling S
ID=obs.index.unique()
S=[]
good_bye_list = []
for i in tqdm(ID):
app=obs.loc[i]
if len(np.unique([app['mixid2'],])) != 1:
#fill the class list
S.append(star(app['magh'].values,app['mixid2'].values,z_in))
else :
#drop
good_bye_list.append(i)
obs=obs.drop(good_bye_list)
The .csv file is very large so it takes 40 min to compute everything.
How can I improve the speed??
Thank you for the help.
This is the .csv file:
id,mixid2,magh
3447001203296326,557,14.25
3447001203296326,573,14.25
3447001203296326,525,14.25
3447001203296326,541,14.25
3447001203296330,540,15.33199977874756
3447001203296330,573,15.33199977874756
3447001203296333,172,17.476999282836914
3447001203296333,140,17.476999282836914
3447001203296333,188,17.476999282836914
3447001203296333,156,17.476999282836914
3447001203296334,566,15.626999855041506
3447001203296334,534,15.626999855041506
3447001203296334,550,15.626999855041506
3447001203296338,623,14.800999641418455
3447001203296338,639,14.800999641418455
3447001203296338,607,14.800999641418455
3447001203296344,521,12.8149995803833
3447001203296344,537,12.8149995803833
3447001203296344,553,12.8149995803833
3447001203296345,620,12.809000015258787
3447001203296345,543,12.809000015258787
3447001203296345,636,12.809000015258787
3447001203296347,558,12.315999984741213
3447001203296347,542,12.315999984741213
3447001203296347,526,12.315999984741213
3447001203296352,615,12.11299991607666
3447001203296352,631,12.11299991607666
3447001203296352,599,12.11299991607666
3447001203296360,540,16.926000595092773
3447001203296360,556,16.926000595092773
3447001203296360,572,16.926000595092773
3447001203296360,524,16.926000595092773
3447001203296367,490,15.80799961090088
3447001203296367,474,15.80799961090088
3447001203296367,458,15.80799961090088
3447001203296369,639,15.175000190734865
3447001203296369,591,15.175000190734865
3447001203296369,623,15.175000190734865
3447001203296369,607,15.175000190734865
3447001203296371,460,14.975000381469727
3447001203296373,582,14.532999992370605
3447001203296373,614,14.532999992370605
3447001203296373,598,14.532999992370605
3447001203296374,184,14.659000396728516
3447001203296374,203,14.659000396728516
3447001203296374,152,14.659000396728516
3447001203296374,136,14.659000396728516
3447001203296374,168,14.659000396728516
3447001203296375,592,14.723999977111815
3447001203296375,608,14.723999977111815
3447001203296375,624,14.723999977111815
3447001203296375,92,14.723999977111815
3447001203296375,76,14.723999977111815
3447001203296375,108,14.723999977111815
3447001203296375,576,14.723999977111815
3447001203296376,132,14.0649995803833
3447001203296376,164,14.0649995803833
3447001203296376,180,14.0649995803833
3447001203296376,148,14.0649995803833
3447001203296377,168,13.810999870300293
3447001203296377,152,13.810999870300293
3447001203296377,136,13.810999870300293
3447001203296377,184,13.810999870300293
3447001203296378,171,13.161999702453613
3447001203296378,187,13.161999702453613
3447001203296378,155,13.161999702453613
3447001203296378,139,13.161999702453613
3447001203296380,565,13.017999649047852
3447001203296380,517,13.017999649047852
3447001203296380,549,13.017999649047852
3447001203296380,533,13.017999649047852
3447001203296383,621,13.079999923706055
3447001203296383,589,13.079999923706055
3447001203296383,605,13.079999923706055
3447001203296384,541,12.732000350952148
3447001203296384,557,12.732000350952148
3447001203296384,525,12.732000350952148
3447001203296385,462,12.784000396728516
3447001203296386,626,12.663999557495115
3447001203296386,610,12.663999557495115
3447001203296386,577,12.663999557495115
3447001203296389,207,12.416000366210938
3447001203296389,255,12.416000366210938
3447001203296389,223,12.416000366210938
3447001203296389,239,12.416000366210938
3447001203296390,607,12.20199966430664
3447001203296390,591,12.20199966430664
3447001203296397,582,16.635000228881836
3447001203296397,598,16.635000228881836
3447001203296397,614,16.635000228881836
3447001203296399,630,17.229999542236328
3447001203296404,598,15.970000267028807
3447001203296404,631,15.970000267028807
3447001203296404,582,15.970000267028807
3447001203296408,540,16.08799934387207
3447001203296408,556,16.08799934387207
3447001203296408,524,16.08799934387207
3447001203296408,572,16.08799934387207
3447001203296409,632,15.84000015258789
3447001203296409,616,15.84000015258789
Hello and welcome to StackOverflow.
In pandas the rule of thumb is that raw loops are always slower than the dedicated functions. To apply a function to a sub-DataFrame of rows that fulfill certain criteria you can use groupby
In your case the function is a bit ... unpythonic as the instantiation of S is a side effect and the deleting of rows you are currenty iterating over is dangerous. For example in a dictionary you should never do this. That said, you can create a function like this:
In [37]: def my_func(df):
...: if df['mixid2'].nunique() == 1:
...: return None
...: else:
...: S.append(df['mixid2'])
...: return df
and apply it to you DataFrame via
S = []
obs.groupby('id').apply(my_func)
This iterates over all subdataframes with the same id and drops them if there is exactly one unique value in mixid2. Otherwise it appends the values to a list S
The resulting DataFrame is 3 rows shorter
Out[38]:
id mixid2 magh
id
3447001203296326 0 3447001203296326 557 14.250000
1 3447001203296326 573 14.250000
... ... ... ...
3447001203296409 98 3447001203296409 632 15.840000
99 3447001203296409 616 15.840000
[97 rows x 3 columns]
and S contains 28 elements. That you could pass into the star constructor just as you did.
I guess you want to groupby and exclude all the elements where mixid2 appears more than 1 times using set_index. To get the original shape, we use reset_index after the filtering.
df = obs.set_index('mixid2').loc[~df.groupby('mixid2').count().id.eq(1)].reset_index()
df.shape
(44, 3)
I'm not entirely sure, if I understood you correctly. But what you can do is first remove duplicates in your dataframe and then use the groupby function to get all the remaining data points with same id:
# dropping all duplicates based on id an mixid2
df.drop_duplicates(["id", "mixid2"], inplace=True)
# then iterate over all groups:
for index, grp in df.groupby(["id"]):
pass # do stuff here with the grp
Normally it is a good idea to rely on pandas internal functions, since they are mostly optimised quite well.
new_df = app.groupby(['id','mixid2'], as_index=False).agg('count')
new_df = new_df[new_df['magh'] > 1]
then pass new_df to your function.
I have a Dataframe with a column which contains integers and sometimes a string which contains multiple numbers which are comma separated (like "1234567, 89012345, 65425774").
I want to convert that string to an integer list so it's easier to search for specific numbers.
In [1]: import pandas as pd
In [2]: raw_input = "1111111111 666 10069759 9695011 9536391,2261003 9312405 15542804 15956127 8409044 9663061 7104622 3273441 3336156 15542815 15434808 3486259 8469323 7124395 15956159 3319393 15956184
: 15956217 13035908 3299927"
In [3]: df = pd.DataFrame({'x':raw_input.split()})
In [4]: df.head()
Out[4]:
x
0 1111111111
1 666
2 10069759
3 9695011
4 9536391,2261003
Since your column contains strings and integers, you want probably something like this:
def to_integers(column_value):
if not isinstance(column_value, int):
return [int(v) for v in column_value.split(',')]
else:
return column_value
df.loc[:, 'column_name'] = df.loc[:, 'column_name'].apply(to_integers)
Your best solution to cases like this, where a column has 1 or more values, is splitting the data into multiple columns.
Try something like
ids = df.ID.str.split(',', n=1, expand=True)
for i in range(3):
df['ID' + str(i + 1)] = ids.iloc[, i]
I am trying to count the number of times that any string from a list_of_strings appears in a csv file cell.
For example, the following would work fine.
import pandas as pd
data_path = "SurveryResponses.csv"
df = pd.read_csv(data_path)
totalCount = 0
for row in df['rowName']:
if type(row) == str:
print(row.count('word_of_interest'))
However, I would like to be able to enter a list of strings (['str1', str2', str3']) rather than just one 'word_of_interest', such that if any of those strings appear the count value will increase by one.
Is there a way to do this?
Perhaps something along the lines of
totalCount = 0
words_of_interst = ['cat','dog','foo','bar']
for row in df['rowName']:
if type(row) == str:
if sum([word in row for word in words_of_interst]) > 0:
totalCount += 1
Use the str accessor:
df['rowName'].str.count('word_of_interest')
If you need to convert the column to string first, use astype:
df['rowName'].astype(str).str.count('word_of_interest')
Assuming list_of_strings = ['str1', str2', str3'] you can try the following:
if any(map(lambda x: x in row, list_of_strings)):
totalCount += 1
You can use this method to count from an external list
strings = ['string1','string2','string3']
sum([1 if sr in strings else 0 for sr in df.rowName])
Here is an example:
import io
filedata = """animal,amount
"['cat','dog']",2
"['cat','horse']",2"""
df = pd.read_csv(io.StringIO(filedata))
Returns this dataframe:
animal amount
0 ['cat','dog'] 2
1 ['cat','horse'] 2
Search for word cat (looping through all columns as series):
search = "cat"
# sums True for each serie and then wrap a sum around all sums
# sum([2,0]) in this case
sum([sum(df[cols].astype(str).str.contains(search)) for cols in df.columns])
Returns 2