Pandas Split Columns in Columns different sizes - python

--Edited--['SOLVED']
I am using tabula to convert pdf invoices to pandas dataframe, but the last column isn't in the good way.
I want to split the last row named 'PVF c/ IVA PVA s/Tx Desc% Tx Inf. IVA% P.Unit. Total Liq.'
I want to split, in each space, and have new columns ['PVFc/IVA', 'PVAs/Tx', 'Desc%' 'TxInf.', 'IVA%', 'P.Unit.', 'Total Liq.'], and the rows should be split for each space. Row2 '7,41', '6,30', '65,0', '0,03', '6', '2,24', '22,40'.
I have searched and found how to split, but... some rows will be split in 7 columns and other only in 6 columns and I get an error.
For more information, every row which 'PVP c/Iva' is NaN or 'Esc.' is 'NETT' don't have 'PVFc/IVA' value, so the (len) of the column is 6. it's possible for my analyses insert 0,00 as prefix in that rows to all have a 7 columns len().
Any solution is welcome, I am starting with Python and pandas... thanks for your time
I apply parts of the code from #Ahmed Sayed, and i have made progess,
to concatenate Nan Colums with other, first i replace Nan with a space
dataframe['placeHolderColumn'] = dataframe['placeHolderColumn'].fillna(value='')
after some trying e errors, i found that sometimes there are more than one space, so I have replaced all spaces for one space, and then replace '*'
dataframe["newColumn"]= dataframe['newColumn'].str.replace(' ','*')
the i have created a new column to confirme the split element
dataframe["count2"]= dataframe['newColumn'].str.count('\*', re.I)
I get this result
So, as last job i apply the split métode,
dataframe[['c1','c2','c3','c4','c5','c6']] = dataframe['newColumn'].str.split('*', expand=True)
but i get this error
--FOUND--
i have to pass another column name, i am just passing 6 new colums and i have 7 values
dataframe[['c1','c2','c3','c4','c5','c6', 'c7']] = dataframe['newColumn'].str.split('*', expand=True)

So the problem here is the cells do not have an equal number of values in that column, we can address this by counting the number of values and wherever we see a missing value, we can add a dummy 00 at the beginning so it is easier for us to split later.
first, let's create a column with the number of spaces. This gives the number of values in that row.
import re
df["count"]= df['PVF c/ IVA PVA s/Tx Desc% Tx Inf. IVA% P.Unit. Total Liq.'].str.count(' ', re.I)
then, if the count is less than what we are expecting, let's append a zero at the beginning of each cell string
# here we compare the number of spaces to 5, 5 is for the short cells that need a dummy 00 at the beginning
df.loc[df["count"] <= 5, 'placeHolderColumn'] = '00 ' #notice there is a space after the zeros
# now let's create a new column and merge the placeHolderColumn column to the old values column
df['newColumn'] = df['placeHolderColumn'] + df['PVF c/ IVA PVA s/Tx Desc% Tx Inf. IVA% P.Unit. Total Liq.'].astype(str)
lastly, we can split the column by
df[['c1','c2','c3','c4','c5','c6']] = df['newColumn'].str.split(' ', expand=True)

Related

How to strip values from columns

I have this dataset where there is a column named 'Discount' and the values are given as '20% off', '25% off' etc.
What i want is to keep just the number in the column and remove the % symbol and the 'off' string.
I'm using this formula to do it.
df['discount'] = df['discount'].apply(lambda x: x.lstrip('%').rstrip('off')
However, when i apply that formula, all the values in the column 'discount' becomes "nan".
I even used this formula as well,
df['discount'] = df['discount'].str.replace('off' , '')
However, this does the same thing.
Is there any other way of handling this? I just want to make all the values in that column to be just the number which is like 25, 20, 10 and get rid of that percentage sign and the string value.
Try this:
d['discount'] = d['discount'].str.replace(r'(%|\s*off)', '', regex=True).astype(int)
Output:
>>> df
discount
0 20
1 25
I came up with this solution:
d['discount'] = d['discount'].split("%")[0]
or as int:
d['discount'] = int(d['discount'].split("%")[0])
We chop the string in two pieces at the %-sign and then take the first part, the number.
If you have a fixed % off suffix, the most efficient is to just remove the last 5 characters:
d['discount'] = d['discount'].str[:-5].astype(int)

How to deal with Pandas dataframe column with list containing string values, get unique words

I am trying to do some basic operations on a dataframe column (called dimensions) that contains a list. Do basic operations like df['dimensions'].str.replace() work when the dataframe column contains a list? It did not work for me. I also tried to replace the text in the column using re.sub() method and it did not work either.
This is the last column in my dataframe:
**dimensions**
[50' long]
None
[70ft long, 19ft wide, 8ft thick]
[5' high, 30' long, 18' wide]
This is what I have tried, but it did not work:
def dimension_unique_words(dimensions):
if dimensions != 'None':
for value in dimensions:
new_value = re.sub(r'[^\w\s]|ft|feet', ' ', value)
new_value = ''.join([i for i in new_value if not i.isdigit()])
return new_value
df['new_col'] = df['dimensions'].apply(dimension_unique_words)
this is the output I got from my code:
**new_col**
NaN
None
NaN
None
NaN
None
What I want to do is to replace the numbers and the units [ft, feet, '] in the column called dimensions with a space and then apply the df.unique() on that column to get the unique values which are [long, wide, thick, high].
The expected output would be:
**new_col**
[long]
None
[long, wide, thick]
[high, long, wide]
...then I want to apply the df.unique() on the new_col to get [long, wide, thick, high]
How to do that?
First we deal with the annoyance that your 'dimensions' column is sometimes None, sometimes a list of one string element. So extract that element when it's non-null:
df['dimensions2'] = df['dimensions'].apply(lambda col: col[0] if col else None)
Next, get all alphabetic strings in each row, excluding measurements:
>>> df['dimensions2'].str.findall(r'\b([a-zA-Z]+)')
0 [long]
1 None
2 [long, wide, thick]
3 [high, long, wide]
Note we use \b word-boundary (to exclude the 'ft' from '30ft'), and to avoid misinterpreting \b as backslash we have to use r'' rawstring on the regex.
This gives you a list. You wanted a set, to prevent duplicates occurring, so:
df['dimensions2'].str.findall(r'\b([a-zA-Z]+)').apply(lambda l: set(l) if l else None)
0 {long}
1 None
2 {thick, long, wide}
3 {high, long, wide}
First we deal with the annoyance that your 'dimensions' column is sometimes None, sometimes a list of one string element. So extract that element when it's non-null:
df['dimensions2'] = df['dimensions'].apply(lambda col: col[0] if col else None)
Next, get all alphabetic strings in each row, excluding measurements:
>>> df['dimensions2'].str.findall(r'\b([a-z]+)')
0 [long]
1 None
2 [long, wide, thick]
3 [high, long, wide]
Note we use \b word-boundary (to exclude the 'ft' from '30ft'), and to avoid misinterpreting \b as backslash we have to use r'' rawstring on the regex.
use str.findall to find all dimensions values to a list.
use explode to explode the list to elements with the same index.
then use groupby(level=0).unique() to drop duplicates by index to a list.
df['new_col'] = (
df['dimensions'].fillna('').astype(str)
.str.findall(r'\b[a-zA-Z]+\b')
.explode().dropna()
.groupby(level=0).unique()
)
use df['new_col'].explode().dropna().unique() to get the unique dimensions values.
array(['long', 'wide', 'thick', 'high'], dtype=object)

How to ignore `NaN` values in a pandas dataframe while using `rjust()`?

I have a pandas dataframe (df) with a column ('ISNN'). Most of the values in that column are strings of 8 characters (e.g. "12345678"). Some of them are smaller (e.g. "983750") and I would like to add a left padding of zeros in order to reach exactly 8 characters (in the previous example, thus obtaining "00983750")
I am using rjust as follows and it works as expected:
df['ISSN'] = df['ISSN'].apply(lambda x: str(x).rjust(8, '0'))
But since some of the values of that column are NaN, they get modified as well and I get 00000nan. How can I apply rjust() just to non-NaN values?
Use Pandas' .str.zfill, which handles NaN for you:
# sample data
df = pd.DataFrame({"ISSN":[np.nan, '1234', '12345678']})
df['ISSN'] = df['ISSN'].str.zfill(8)
Output:
ISSN
0 NaN
1 00001234
2 12345678

How to count characters across strings in a pandas column

I have a dataframe with the following structure:
prod_sec
A
AA
AAAAAAAAAAB
AAAABCCCAA
AACC
ABCCCBAC
df = pd.DataFrame({'prod_sec': ['A','AA','AAAAAAAAAAB','AAAABCCCAA','AACC','ABCCCBAC']})
Each string is a sequence made up of letters (A to C in this example).
I would like to create a list for each letter that counts the occurrences in each position down the entire pandas column.
For example in the first string A is only in the first position/index and it's not in the other locations.
In the second string the A in the first two positions and it's not in the other locations
In the third string the A has all the positions until the last one. Etc... I want a total count, for the column, by position. Here is an example for A:
A -> [1,0,0,0,0,0,0,0,0,0,0]
AA [1,1,0,0,0,0,0,0,0,0,0]
AAAAAAAAAAB -> [1,1,1,1,1,1,1,1,1,1,0]
AAAABCCCAA [1,1,1,1,0,0,0,0,0,0,1]
AACC [1,1,0,0,0,0,0,0,0,0,0]
ABCCCBAC -> [1,0,0,0,0,0,1,0,0,0,0]
so for A, I would want an output similar to the following... A [6,4,2,2,1,1,2,1,1,1,0]
In the end, I'm trying to get a matrix with a row for each character.
[6,4,2,2,1,1,2,1,1,1,0]
[0,1,0,0,1,1,0,0,0,0,1]
[0,0,1,1,0,1,2,0,0,0,0]
The following should work. You can adjust the result, depending on your exact needs (numpy array, data frame, dictionary, etc). Tell me if you need more help with that.
max_length=max([len(i) for i in df.prod_sec])
d={'A':[0]*max_length, 'B':[0]*max_length, 'C':[0]*max_length}
for i in df.prod_sec:
for k in range(len(i)):
d[i[k]][k]+=1
result=pd.DataFrame.from_dict(d, orient='index')

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.
To get a pandas Series indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)
This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)

Categories

Resources