Pandas Removing Leading Zeros - python

I have a short script to pivot data. The first column is a 9 digit ID number, often beginning with zeros such as 000123456
Here is the script:
df = pd.read_csv('source')
new_df = df.pivot_table(index = 'id', columns = df.groupby('id').cumcount().add(1), values = ['prog_id', 'prog_type'], aggfunc='first').sort_index(axis=1,level=1)
new_df.columns = [f'{x}_{y}' for x,y in new_df.columns]
new_df.to_csv('destination')
print(new_df)
Although the CSV is being read with an id such as 000123456, the output only contains 123456
Even when setting an explicit dtype, Pandas removes the leading zeros. Is there a work around for telling Pandas to leave the leading zeros?

Per comment on original post, set dtype as string:
df = pd.read_csv('source', dtype={'id':np.str})

You could use pandas' zfill() method right after reading your csv file "source". Basically, you would fill the values of your attribute "id", with as many zeros as you would like, in this particular case, making the number 9 digits long (3 zeros + 6 original digits). So, we would have:
df = pd.read_csv('source')
df.index = df.index.str.zfill(9)
# (...)

Related

Assigning numeric values to column in Python

I'm a beginner so sorry in advance if I'm unclear :)
I have a .csv with 2 columns, doc_number and text. However, sometimes the rows start with the doc number (as they should) and sometimes it just starts with text from the previous row. All input is the type 'object'. There are also many empty rows between the inputs.
How can I make sure doc_number consists of all the numeric values (doc_numbers are random numbers of 8 digits) and text is the text, and remove the empty rows?
Example of what it looks like:
69353029, Hello. How are you doing?
What are you going to do tomorrow?
59302058, Tomorrow I have to go to work.
58330394, It's going to rain tomorrow
45801923, Yesterday it was sunny.
Next week it will also be sunny.
68403942, Thank you.
What it should look like:
_doc, _text
69353029, Hello. How are you doing? What are you going to do tomorrow?
59302058, Tomorrow I have to go to work.
58330394, It's going to rain tomorrow.
45801923, Yesterday it was sunny. Next week it will also be sunny.
68403942, Thank you.```
Here's what I can think of when looking at the dataset. It looks like a CSV file which translates to an unnamed column (representing _doc) with mixed integers and strings and another unnamed column with strings (representing _text)
Since, it has to be csv, I created the following dummy.csv file.
I would turn the CSV into a dataframe and add the initial column titles mixed and text as follows:
df = pd.read_csv('dummy.csv', header= None)
df.columns = ['mixed', 'text']
df
which results in the data frame df for cleaning:
What we can do is separate the mixed column into two different columns m_numeric and m_text. This can be done by creating the m_numeric column using
to_numeric(,errors='coerce') and non-numeric strings masked to m_text column as follows:
df['m_numeric'] = pd.to_numeric(df['mixed'], errors='coerce')
mask = df['m_numeric'].isna()
df.loc[mask, 'm_text'] = df.loc[mask, 'mixed']
We can now fill the NaN values in m-numeric using ffill from fillna() which propagates last valid observation forward to next valid backfill. Similarly, we can fill the original text columns NaN with the strings from m_text as follows:
df['m_numeric'] = df['m_numeric'].fillna(method='ffill')
df['m_numeric'] = df['m_numeric'].astype(int)
df['text'] = df['text'].fillna(df['m_text'])
Now we have the final columns we need as text and m_numeric on which we can apply the pandas groupby() function. We groupby the m_numeric column and use .apply to join the strings in two rows separated by a space. Finally we can rename the column names to _doc and _text as follows:
df = df.groupby('m_numeric', sort= False)['text'].apply(' '.join).reset_index()
df = df.rename(columns= {'m_numeric' : '_doc',
'text' : '_text'})
Result:
Complete Code:
import pandas as pd
df = pd.read_csv('dummy.csv', header= None)
df.columns = ['mixed', 'text']
#separate column
df['m_numeric'] = pd.to_numeric(df['mixed'], errors='coerce')
mask = df['m_numeric'].isna()
df.loc[mask, 'm_text'] = df.loc[mask, 'mixed']
#replace nan values
df['m_numeric'] = df['m_numeric'].fillna(method='ffill')
df['m_numeric'] = df['m_numeric'].astype(int)
df['text'] = df['text'].fillna(df['m_text'])
#group by column
df = df.groupby('m_numeric', sort= False)['text'].apply(' '.join).reset_index()
df = df.rename(columns= {'m_numeric' : '_doc',
'text' : '_text'})
df
also just learnning to panda, so maybe not the best solution:
import pandas as pd
import numpy as np
#reading .csv and naming the Headers, - you can choose your own namings here
df = pd.read_csv("text.csv", header=None, names = ["_doc","_text"])
# updating _text column emty cells with values from _doc
df["_text"] = np.where(df['_text'].isnull(),df["_doc"],df["_text"])
# change dtype to int (it will generate <NA> in strings) and back to str to aggregate later
df["_doc"] = df["_doc"].apply(pd.to_numeric, errors="coerce").astype("Int64").astype(str)
# aggregating rows if below value is <NA> and joining strings in col _text
df = df.groupby((df["_doc"].ne("<NA>")).cumsum()).agg({"_doc":"first","_text":" ".join}).reset_index(drop=True)
# converting back to int (if needed)
df["_doc"] = df["_doc"].astype(int)
print(df)
out:
_doc _text
0 69353029 Hello. How are you doing? What are you going ...
1 59302058 Tomorrow I have to go to work.
2 58330394 It's going to rain tomorrow
3 45801923 Yesterday it was sunny. Next week it will als...
4 68403942 Thank you.

Drop decimals and add commas Pandas

I want a column in my Dataframe to have no decimals but have commas. It's for a bar chart. Every time I add the commas I get the decimals. Even if I convert the column to integers first. Here is the DataFrame and what I tried that is not working!
df = pd.read_csv('https://github.com/ngpsu22/indigenous-peoples-day/raw/main/native_medians_means')
summary.med_resources_per_person.astype(int)
summary["med_resources_per_person"] = (summary["med_resources_per_person"].apply(lambda x : "
{:,}".format(x)))
You're not actually changing the dtype to int inside of the dataframe. You'll need to assign it back to the column:
df = pd.read_csv('https://github.com/ngpsu22/indigenous-peoples-day/raw/main/native_medians_means')
df["med_resources_per_person"] = df["med_resources_per_person"].astype(int)
df["med_resources_per_person"].apply(lambda x : "{:,}".format(x))
Or a little bit more concise:
df = pd.read_csv('https://github.com/ngpsu22/indigenous-peoples-day/raw/main/native_medians_means')
df["med_resources_per_person"] = df["med_resources_per_person"].astype(int).apply("{:,}".format)

Pandas not converting certain columns of dataframe to datetimeindex

My dataframe until now,
and I am trying to convert cols which is a list of all columns from 0 to 188 ( cols = list(hdata.columns[ range(0,188) ]) ) which are in this format yyyy-mm to datetimeIndex. There are other few columns as well which are 'string' Names and can't be converted to dateTime hence,so I tried doing this,
hdata[cols].columns = pd.to_datetime(hdata[cols].columns) #convert columns to **datetimeindex**
But this is not working.
Can you please figure out what is wrong here?
Edit:
A better way to work on this type of data is to use Split-Apply-Combine method.
Step 1: Split the data which you want to perform some specific operation.
nonReqdf = hdata.iloc[:,188:].sort_index()
reqdf= reqdf.drop(['CountyName','Metro','RegionID','SizeRank'],axis=1)
Step 2: do the operations. In my case it was converting the dataframe columns with year and months to datetimeIndex. And resample it quarterly.
reqdf.columns = pd.to_datetime(reqdf.columns)
reqdf = reqdf.resample('Q',axis=1).mean()
reqdf = reqdf.rename(columns=lambda x: str(x.to_period('Q')).lower()).sort_index() # renaming so that string is yyyy**q**<1/2/3/4> like 2012q1 or 2012q2 likewise
Step 3: Combine the two splitted dataframe.(merge can be used but may depend on what you want)
reqdf = pd.concat([reqdf,nonReqdf],axis=1)
In order to modify some of the labels from an Index (be it for rows or columns), you need to use df.rename as in
for i in range(188):
df.rename({df.columns[i]: pd.to_datetime(df.columns[i])},
axis=1, inplace=True)
Or you can avoid looping by building a full sized index to cover all the columns with
df.columns = (
pd.to_datetime(cols) # pass the list with strings to get a partial DatetimeIndex
.append(df.columns.difference(cols)) # complete the index with the rest of the columns
)

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

Formatting the data which increases monotonically in Python

I have formatted the data according to the need. Now my final data or dataframe is not monotonically increasing whereas the input data is increasing monotonically according to the 1st column field (freq). Here is the link for Data_input_truncated.txt. My python code is in the below:
import pandas as pd
#create DataFrame from csv with columns f and v
df = pd.read_csv('Data_input.txt', sep="\s+", names=['freq','v'])
#boolean mask for identify columns of new df
m = df['v'].str.endswith(')')
#new column by replace NaNs by forward filling
df['g'] = df['v'].where(m).ffill()
#get original ordering for new columns
cols = df['g'].unique()
#remove rows with same values in v and g columns
df = df[df['v'] != df['g']]
#reshape by pivoting with change ordering of columns by reindex
df = df.pivot('freq', 'g', 'v').rename_axis(None, axis=1).reindex(columns=cols).reset_index()
df.columns = [x.replace('(','').replace(')','').replace(',',':') for x in df.columns]
df.to_csv('target.txt', index=False, sep='\t')
Now the created target.txt is not monotonic. Here is the link for target.txt. How can I make it monotonic before saving as a file?
I am using Spyder 3.2.6 (Anaconda) where python 3.6.4 64-bit is embedded.
The problem is that your data is str and not a float, and while pivoting, it is reorder with alphabetical order. One option could be to change the type of the freq column to float, and then if the formatting as scientific number is important, you can set the float_format parameter during to_csv:
### same code before
#remove rows with same values in v and g columns
df = df[df['v'] != df['g']]
# convert to float
df['freq']= df['freq'].astype(float)
#reshape by pivoting with change ordering of columns by reindex
df = df.pivot('freq', 'g', 'v').rename_axis(None, axis=1).reindex(columns=cols).reset_index()
df.columns = [x.replace('(','').replace(')','').replace(',',':') for x in df.columns]
df.to_csv('target.txt', index=False, sep='\t', float_format='%.17E' ) # add float_format='%.17E'
Note float_format='%.17E' means scientific notation with 17 numbers after the . as in your input, but you can change this number to anyone you want if they are not important.
EDIT: I get this result in target.txt (first 5 rows and 3 columns)
freq R1:1 R1:2
0.00000000000000000E+00 4.07868642871600962E0 3.12094533520232087E-13
1.00000000000000000E+06 4.43516799439728793E0 4.58503433913467795E-3
2.00000000000000000E+06 4.54224931058591253E0 1.21517855438593236E-2
3.00000000000000000E+06 4.63952376349496909E0 2.10017318391844077E-2
4.00000000000000000E+06 4.74002677709486608E0 3.05258806632440871E-2

Categories

Resources