How do I remove non-alphabet from the values in the dataframe? I only managed to convert all to lower case
def doubleAwardList(self):
dfwinList = pd.DataFrame()
dfloseList = pd.DataFrame()
dfwonandLost = pd.DataFrame()
#self.dfWIN... and self.dfLOSE... is just the function used to call the files chosen by user
groupby_name= self.dfWIN.groupby("name")
groupby_nameList= self.dfLOSE.groupby("name _List")
list4 = []
list5 = []
notAwarded = "na"
for x, group in groupby_name:
if x != notAwarded:
list4.append(str.lower(str(x)))
dfwinList= pd.DataFrame(list4)
for x, group in groupby_nameList:
list5.append(str.lower(str(x)))
dfloseList = pd.DataFrame(list5)
data sample: Basically I mainly need to remove the full stops and hyphens as I will require to compare it to another file but the naming isn't very consistent so i had to remove the non-alphanumeric for much more accurate result
creative-3
smart tech pte. ltd.
nutritive asia
asia's first
desired result:
creative 3
smart tech pte ltd
nutritive asia
asia s first
Use DataFrame.replace only and add whitespace to pattern:
df = df.replace('[^a-zA-Z0-9 ]', '', regex=True)
If one column - Series:
df = pd.DataFrame({'col': ['creative-3', 'smart tech pte. ltd.',
'nutritive asia', "asia's first"],
'col2':range(4)})
print (df)
col col2
0 creative-3 0
1 smart tech pte. ltd. 1
2 nutritive asia 2
3 asia's first 3
df['col'] = df['col'].replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
EDIT:
If multiple columns is possible select only object, obviously string columns and if necessary cast to strings:
cols = df.select_dtypes('object').columns
print (cols)
Index(['col'], dtype='object')
df[cols] = df[cols].astype(str).replace('[^a-zA-Z0-9 ]', '', regex=True)
print (df)
col col2
0 creative3 0
1 smart tech pte ltd 1
2 nutritive asia 2
3 asias first 3
Why not just the below, (i did make into lower btw):
df=df.replace('[^a-zA-Z0-9]', '',regex=True).str.lower()
Then now:
print(df)
Will get the desired data-frame
Update:
try:
df=df.apply(lambda x: x.str.replace('[^a-zA-Z0-9]', '').lower(),axis=0)
If only one column do:
df['your col']=df['your col'].str.replace('[^a-zA-Z0-9]', '').str.lower()
Related
Let's say that I have this dataframe with four columns : "Name", "Value", "Ccy" and "Group" :
import pandas as pd
Name = ['ID', 'Country', 'IBAN','Dan_Age', 'Dan_city', 'Dan_country', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56','18', 'Berlin', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','USD','USD','','CHF', '','DKN','']
Group = ['0','0','0','1','1','1','1','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 Dan_Age 18 EUR 1
4 Dan_city Berlin EUR 1
5 Dan_country GER USD 1
6 Dan_sex M USD 1
7 Dan_Age 22 2
8 Dan_country FRA CHF 2
9 Dan_sex M 2
10 Dan_city Madrid DKN 3
11 Dan_country ESP 3
I want to represent this data differently before saving it in a csv. I would like to group the duplicates in the column "Name" with the associates values in "Values" and "Ccy". I want that the data in the column "Value" and "Ccy" are stored in the row(index) defined by the column "Group". Like that I do not mixed the data.
Then if the name is in the "group" 0, it means that it is general data so I would like that the all the rows from this "Name" are filled with the same value.
So I would like to get this result :
ID_Value Country_Value IBAN_Value Dan_age Dan_age_Ccy Dan_city_Value Dan_city_Ccy Dan_sex_Value
1 TAMARA GER FR56 18 EUR Berlin EUR M
2 TAMARA GER FR56 22 M
3 TAMARA GER FR56 Madrid DKN
I can not find how to do the first part. With the code below, I do not get what I want evn if I remove the columns empty
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
Anyone can help me !
Thank you
You can use the following. See comments in code for each step:
s = df.loc[df['Group'] == '0', 'Name'].tolist() # this variable will be used later according to Condition 2
df['Name'] = pd.Categorical(df['Name'], categories=df['Name'].unique(), ordered=True) #this preserves order before pivoting
df = df.pivot(index='Group', columns='Name') #transforms long-to-wide per expected output
for col in df.columns:
if col[1] in s: df[col] = df[col].shift().ffill() #Condition 2
df = df.iloc[1:].replace('',np.nan).dropna(axis=1, how='all').fillna('') #dataframe cleanup
df.columns = ['_'.join(col) for col in df.columns.swaplevel()] #column name cleanup
df
Out[1]:
ID_Value Country_Value IBAN_Value Dan_Age_Value Dan_city_Value \
Group
1 TAMARA_CO GERMANY FR56 18 Berlin
2 TAMARA_CO GERMANY FR56 22
3 TAMARA_CO GERMANY FR56 Madrid
Dan_country_Value Dan_sex_Value Dan_Age_Ccy Dan_city_Ccy \
Group
1 GER M EUR EUR
2 FRA M
3 ESP DKN
Dan_country_Ccy Dan_sex_Ccy
Group
1 USD USD
2 CHF
3
From there, you can drop columns you don't want, change strings from "TAMARA_CO" to "TAMARA", "GERMANY" to "GER", use reset_index(drop=True), etc.
You can do this quite easily with only 3 steps:
Split your data frame into 2 parts: the "general data" (which we want as a series) and the more specific data. Each data frame now contains the same kinds of information.
The key part of your problem: reorganizing the data. All you need is the pandas pivot function. It does exactly what you need!
Add the general information and the pivoted data back together.
# Split Data
general = df[df.Group == "0"].set_index("Name")["Value"].copy()
main_df = df[df.Group != "0"]
# Pivot Data
result = main_df.pivot(index="Group", columns=["Name"],
values=["Value", "Ccy"]).fillna("")
result.columns = [f"{c[1]}_{c[0]}" for c in result.columns]
# Create a data frame that has an identical row for each group
general_df = pd.DataFrame([general]*3, index=result.index)
general_df.columns = [c + "_Value" for c in general_df.columns]
# Merge the data back together
result = general_df.merge(result, on="Group")
The result given above does not give the exact column order you want, so you'd have to specify that manually with
final_cols = ["ID_Value", "Country_Value", "IBAN_Value",
"Dan_age_Value", "Dan_Age_Ccy", "Dan_city_Value",
"Dan_city_Ccy", "Dan_sex_Value"]
result = result[final_cols]
I have a dataframe df with two columns as follows:
ID Country_pairs
0 X [(France, USA), (USA, France)]
1 Y [(USA, UK), (UK, France), (USA, France)]
I want to output all possible pairs of countries in two columns as follows:
ID Country1 Country2
X France USA
X USA France
Y USA UK
Y UK France
Y USA France
Doing this gives me the output I want:
result = pd.DataFrame()
for index, row in df.iterrows():
x = row['Country_pairs']
temp = pd.DataFrame(data=x, columns=['Country1','Country2'])
temp['PID'] = row['PID']
result = result.append(temp)
print result
The dataframe is over 200 million rows, so this is a very slow since I'm looping. I was wondering if there is a faster solution?
import pandas as pd
# setup the dataframe
df = pd.DataFrame({'ID': ['X', 'Y'],
'Country_Pairs': [[('France', 'USA'), ('USA', 'France')],
[('USA', 'UK'), ('UK', 'France'), ('USA', 'France')]]})
ID Country_Pairs
0 X [(France, USA), (USA, France)]
1 Y [(USA, UK), (UK, France), (USA, France)]
# separate each tuple to its own row with explode
df2 = df.explode('Country_Pairs')
# separate each value in the tuple to its own column
df2[['Country1', 'Counrtry2']] = pd.DataFrame(df2.Country_Pairs.tolist(), index=df2.index)
# delete Country_Pairs
df2.drop(columns=['Country_Pairs'], inplace=True)
ID Country1 Counrtry2
0 X France USA
0 X USA France
1 Y USA UK
1 Y UK France
1 Y USA France
give a trial at this:
explode_df = df.apply(pd.Series.explode)
split_country = explode_df.apply(lambda x: ' '.join(x['Country_pairs']), axis=1).str.split(expand=True)
# whether you would like to combine the results
res = pd.concat([explode_df, split_country], axis=1)
You are looking for .explode()
result = df.explode('Country_pairs')
result["Country1"] = result.Country_pairs.apply(lambda t:t[0])
result["Country2"] = result.Country_pairs.apply(lambda t:t[1])
del result["Country_pairs"]
200 million rows is massive, no point running this computation on Pandas. As suggested in the comments, use Apache spark. or if it's in a database, u could possibly work something there.
The solution I proffer works on small datasets suited to Pandas ... take the data out of Pandas, use the itertools functions - product and chain, and build back the dataframe. It should be reasonably fast, but certainly not for 200 million rows
#using the data provided by #Trenton
df = pd.DataFrame({'ID': ['X', 'Y'],
'Country_Pairs': [[('France', 'USA'), ('USA', 'France')],
[('USA', 'UK'), ('UK', 'France'), ('USA', 'France')]]})
from itertools import product, chain
step1 = chain.from_iterable(product(first, last)
for first, last in
df.to_numpy())
res = pd.DataFrame(((first, *last) for first, last in step1),
columns=['ID', 'Country1', 'Country2'])
res
ID Country1 Country2
0 X France USA
1 X USA France
2 Y USA UK
3 Y UK France
4 Y USA France
I have a dataframe where I want to extract stuff after double space. For all rows in column NAME there is a double white space after the company names before the integers.
NAME INVESTMENT PERCENT
0 APPLE COMPANY A 57 638 232 stocks OIL LTD 0.12322
1 BANANA 1 COMPANY B 12 946 201 stocks GOLD LTD 0.02768
2 ORANGE COMPANY C 8 354 229 stocks GAS LTD 0.01786
df = pd.DataFrame({
'NAME': ['APPLE COMPANY A 57 638 232 stocks', 'BANANA 1 COMPANY B 12 946 201 stocks', 'ORANGE COMPANY C 8 354 229 stocks'],
'PERCENT': [0.12322, 0.02768 , 0.01786]
})
I have this earlier, but it also includes integers in the company name:
df['STOCKS']=df['NAME'].str.findall(r'\b\d+\b').apply(lambda x: ''.join(x))
Instead I tried to extract after double spaces
df['NAME'].str.split('(\s{2})')
which gives output:
0 [APPLE COMPANY A, , 57 638 232 stocks]
1 [BANANA 1 COMPANY B, , 12 946 201 stocks]
2 [ORANGE COMPANY C, , 8 354 229 stocks]
However, I want the integers that occur after double spaces to be joined/merged and put into a new column.
NAME PERCENT STOCKS
0 APPLE COMPANY A 0.12322 57638232
1 BANANA 1 COMPANY B 0.02768 12946201
2 ORANGE COMPANY C 0.01786 12946201
How can I modify my second function to do what I want?
Following the original logic you may use
df['STOCKS'] = df['NAME'].str.extract(r'\s{2,}(\d+(?:\s\d+)*)', expand=False).str.replace(r'\s+', '')
df['NAME'] = df['NAME'].str.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks', '')
Output:
NAME PERCENT STOCKS
0 APPLE COMPANY A 0.12322 57638232
1 BANANA 1 COMPANY B 0.02768 12946201
2 ORANGE COMPANY C 0.01786 8354229
Details
\s{2,}(\d+(?:\s\d+)*) is used to extract the first occurrence of whitespace-separated consecutive digit chunks after 2 or more whitespaces and .replace(r'\s+', '') removes any whitespaces in that extracted text afterwards
.replace(r'\s{2,}\d+(?:\s\d+)*\s+stocks' updates the text in the NAME column, it removes 2 or more whitespaces, consecutive whitespace-separated digit chunks and then 1+ whitespaces and stocks. Actually, the last \s+stocks may be replaced with .* if there are other words.
Another pandas approach, which will cast STOCKS to numeric type:
df_split = (df['NAME'].str.extractall('^(?P<NAME>.+)\s{2}(?P<STOCKS>[\d\s]+)')
.reset_index(level=1, drop=True))
df_split['STOCKS'] = pd.to_numeric(df_split.STOCKS.str.replace('\D', ''))
Assign these columns back into your original DataFrame:
df[['NAME', 'STOCKS']] = df_split[['NAME', 'STOCKS']]
COMPANY_NAME STOCKS PERCENT
0 APPLE COMPANY A 57638232 0.12322
1 BANANA 1 COMPANY B 12946201 0.02768
2 ORANGE COMPANY C 8354229 0.01786
You can use look behind and look ahead operators.
''.join(re.findall(r'(?<=\s{2})(.*)(?=stocks)',string)).replace(' ','')
This catches all characters between two spaces and the word stocks and replace all the spaces with null.
Another Solution using Split
df["NAME"].apply(lambda x:x[x.find(' ')+2:x.find('stocks')-1].replace(' ',''))
Reference:-
Look_behind
You can try
df['STOCKS'] = df['NAME'].str.split(',')[2].replace(' ', '')
df['NAME'] = df['NAME'].str.split(',')[0]
This can be done without using regex by using split.
df['STOCKS'] = df['NAME'].apply(lambda x: ''.join(x.split(' ')[1].split(' ')[:-1]))
df['NAME'] = df['NAME'].str.replace(r'\s?\d+(?:\s\d+).*', '')
I have a number of columns in a dataframe:
df = pd.DataFrame({'Date':[1990],'State Income of Alabama':[1],
'State Income of Washington':[2],
'State Income of Arizona':[3]})
All headers have the same number of strings and all have the exact same strings with exactly one white space between the State's name.
I want to take out the strings 'State Income of ' and leave the state in tact as a new header for the set so they just all read:
Alabama Washington Arizona
1 2 3
I've tried using the replace columns function in Python like:
df.columns = df.columns.str.replace('State Income of ', '')
But this isn't giving me the desired output.
Here is another solution, not in place:
df.rename(columns=lambda x: x.split()[-1])
or in place:
df.rename(columns=lambda x: x.split()[-1], inplace = True)
Your way works for me, but there are alternatives:
One way is to split your column names and take the last word:
df.columns = [i.split()[-1] for i in df.columns]
>>> df
Alabama Arizona Washington
0 1 3 2
You can use the re module for this:
>>> import pandas as pd
>>> df = pd.DataFrame({'State Income of Alabama':[1],
... 'State Income of Washington':[2],
... 'State Income of Arizona':[3]})
>>>
>>> import re
>>> df.columns = [re.sub('State Income of ', '', col) for col in df]
>>> df
Alabama Washington Arizona
0 1 2 3
re.sub('State Income of', '', col) will replace any occurrence of 'State Income of' with an empty string (with "nothing," effectively) in the string col.
I'm trying to remove spaces, apostrophes, and double quote in each column data using this for loop
for c in data.columns:
data[c] = data[c].str.strip().replace(',', '').replace('\'', '').replace('\"', '').strip()
but I keep getting this error:
AttributeError: 'Series' object has no attribute 'strip'
data is the data frame and was obtained from an excel file
xl = pd.ExcelFile('test.xlsx');
data = xl.parse(sheetname='Sheet1')
Am I missing something? I added the str but that didn't help. Is there a better way to do this.
I don't want to use the column labels, like so data['column label'], because the text can be different. I would like to iterate each column and remove the characters mentioned above.
incoming data:
id city country
1 Ontario Canada
2 Calgary ' Canada'
3 'Vancouver Canada
desired output:
id city country
1 Ontario Canada
2 Calgary Canada
3 Vancouver Canada
UPDATE: using your sample DF:
In [80]: df
Out[80]:
id city country
0 1 Ontario Canada
1 2 Calgary ' Canada'
2 3 'Vancouver Canada
In [81]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[81]:
id city country
0 1 Ontario Canada
1 2 Calgary Canada
2 3 Vancouver Canada
OLD answer:
you can use DataFrame.replace() method:
In [75]: df.to_dict('r')
Out[75]:
[{'a': ' x,y ', 'b': 'a"b"c', 'c': 'zzz'},
{'a': "x'y'z", 'b': 'zzz', 'c': ' ,s,,'}]
In [76]: df
Out[76]:
a b c
0 x,y a"b"c zzz
1 x'y'z zzz ,s,,
In [77]: df.replace(r'[,\"\']','', regex=True).replace(r'\s*([^\s]+)\s*', r'\1', regex=True)
Out[77]:
a b c
0 xy abc zzz
1 xyz zzz s
r'\1' - is a numbered capturing RegEx group
data[c] does not return a value, it returns a series (a whole column of data).
You can apply the strip operation to an entire column df.apply. You can apply the strip function this way.