Formatting the data which increases monotonically in Python - python

I have formatted the data according to the need. Now my final data or dataframe is not monotonically increasing whereas the input data is increasing monotonically according to the 1st column field (freq). Here is the link for Data_input_truncated.txt. My python code is in the below:
import pandas as pd
#create DataFrame from csv with columns f and v
df = pd.read_csv('Data_input.txt', sep="\s+", names=['freq','v'])
#boolean mask for identify columns of new df
m = df['v'].str.endswith(')')
#new column by replace NaNs by forward filling
df['g'] = df['v'].where(m).ffill()
#get original ordering for new columns
cols = df['g'].unique()
#remove rows with same values in v and g columns
df = df[df['v'] != df['g']]
#reshape by pivoting with change ordering of columns by reindex
df = df.pivot('freq', 'g', 'v').rename_axis(None, axis=1).reindex(columns=cols).reset_index()
df.columns = [x.replace('(','').replace(')','').replace(',',':') for x in df.columns]
df.to_csv('target.txt', index=False, sep='\t')
Now the created target.txt is not monotonic. Here is the link for target.txt. How can I make it monotonic before saving as a file?
I am using Spyder 3.2.6 (Anaconda) where python 3.6.4 64-bit is embedded.

The problem is that your data is str and not a float, and while pivoting, it is reorder with alphabetical order. One option could be to change the type of the freq column to float, and then if the formatting as scientific number is important, you can set the float_format parameter during to_csv:
### same code before
#remove rows with same values in v and g columns
df = df[df['v'] != df['g']]
# convert to float
df['freq']= df['freq'].astype(float)
#reshape by pivoting with change ordering of columns by reindex
df = df.pivot('freq', 'g', 'v').rename_axis(None, axis=1).reindex(columns=cols).reset_index()
df.columns = [x.replace('(','').replace(')','').replace(',',':') for x in df.columns]
df.to_csv('target.txt', index=False, sep='\t', float_format='%.17E' ) # add float_format='%.17E'
Note float_format='%.17E' means scientific notation with 17 numbers after the . as in your input, but you can change this number to anyone you want if they are not important.
EDIT: I get this result in target.txt (first 5 rows and 3 columns)
freq R1:1 R1:2
0.00000000000000000E+00 4.07868642871600962E0 3.12094533520232087E-13
1.00000000000000000E+06 4.43516799439728793E0 4.58503433913467795E-3
2.00000000000000000E+06 4.54224931058591253E0 1.21517855438593236E-2
3.00000000000000000E+06 4.63952376349496909E0 2.10017318391844077E-2
4.00000000000000000E+06 4.74002677709486608E0 3.05258806632440871E-2

Related

Assigning numeric values to column in Python

I'm a beginner so sorry in advance if I'm unclear :)
I have a .csv with 2 columns, doc_number and text. However, sometimes the rows start with the doc number (as they should) and sometimes it just starts with text from the previous row. All input is the type 'object'. There are also many empty rows between the inputs.
How can I make sure doc_number consists of all the numeric values (doc_numbers are random numbers of 8 digits) and text is the text, and remove the empty rows?
Example of what it looks like:
69353029, Hello. How are you doing?
What are you going to do tomorrow?
59302058, Tomorrow I have to go to work.
58330394, It's going to rain tomorrow
45801923, Yesterday it was sunny.
Next week it will also be sunny.
68403942, Thank you.
What it should look like:
_doc, _text
69353029, Hello. How are you doing? What are you going to do tomorrow?
59302058, Tomorrow I have to go to work.
58330394, It's going to rain tomorrow.
45801923, Yesterday it was sunny. Next week it will also be sunny.
68403942, Thank you.```
Here's what I can think of when looking at the dataset. It looks like a CSV file which translates to an unnamed column (representing _doc) with mixed integers and strings and another unnamed column with strings (representing _text)
Since, it has to be csv, I created the following dummy.csv file.
I would turn the CSV into a dataframe and add the initial column titles mixed and text as follows:
df = pd.read_csv('dummy.csv', header= None)
df.columns = ['mixed', 'text']
df
which results in the data frame df for cleaning:
What we can do is separate the mixed column into two different columns m_numeric and m_text. This can be done by creating the m_numeric column using
to_numeric(,errors='coerce') and non-numeric strings masked to m_text column as follows:
df['m_numeric'] = pd.to_numeric(df['mixed'], errors='coerce')
mask = df['m_numeric'].isna()
df.loc[mask, 'm_text'] = df.loc[mask, 'mixed']
We can now fill the NaN values in m-numeric using ffill from fillna() which propagates last valid observation forward to next valid backfill. Similarly, we can fill the original text columns NaN with the strings from m_text as follows:
df['m_numeric'] = df['m_numeric'].fillna(method='ffill')
df['m_numeric'] = df['m_numeric'].astype(int)
df['text'] = df['text'].fillna(df['m_text'])
Now we have the final columns we need as text and m_numeric on which we can apply the pandas groupby() function. We groupby the m_numeric column and use .apply to join the strings in two rows separated by a space. Finally we can rename the column names to _doc and _text as follows:
df = df.groupby('m_numeric', sort= False)['text'].apply(' '.join).reset_index()
df = df.rename(columns= {'m_numeric' : '_doc',
'text' : '_text'})
Result:
Complete Code:
import pandas as pd
df = pd.read_csv('dummy.csv', header= None)
df.columns = ['mixed', 'text']
#separate column
df['m_numeric'] = pd.to_numeric(df['mixed'], errors='coerce')
mask = df['m_numeric'].isna()
df.loc[mask, 'm_text'] = df.loc[mask, 'mixed']
#replace nan values
df['m_numeric'] = df['m_numeric'].fillna(method='ffill')
df['m_numeric'] = df['m_numeric'].astype(int)
df['text'] = df['text'].fillna(df['m_text'])
#group by column
df = df.groupby('m_numeric', sort= False)['text'].apply(' '.join).reset_index()
df = df.rename(columns= {'m_numeric' : '_doc',
'text' : '_text'})
df
also just learnning to panda, so maybe not the best solution:
import pandas as pd
import numpy as np
#reading .csv and naming the Headers, - you can choose your own namings here
df = pd.read_csv("text.csv", header=None, names = ["_doc","_text"])
# updating _text column emty cells with values from _doc
df["_text"] = np.where(df['_text'].isnull(),df["_doc"],df["_text"])
# change dtype to int (it will generate <NA> in strings) and back to str to aggregate later
df["_doc"] = df["_doc"].apply(pd.to_numeric, errors="coerce").astype("Int64").astype(str)
# aggregating rows if below value is <NA> and joining strings in col _text
df = df.groupby((df["_doc"].ne("<NA>")).cumsum()).agg({"_doc":"first","_text":" ".join}).reset_index(drop=True)
# converting back to int (if needed)
df["_doc"] = df["_doc"].astype(int)
print(df)
out:
_doc _text
0 69353029 Hello. How are you doing? What are you going ...
1 59302058 Tomorrow I have to go to work.
2 58330394 It's going to rain tomorrow
3 45801923 Yesterday it was sunny. Next week it will als...
4 68403942 Thank you.

Populating a column based off of values in another column

Hi I am working with pandas to manipulate some lab data. I currently have a data frame with 5 columns.
The first three columns(Analyte,CAS NO(1), and Value) are in the correct order.
The last two columns(CAS NO 2 and Value 2) are not.
Is there a way to align CAS No(2) and Value(2) with the first three columns based off of matching CAS Numbers(aka CAS NO(2)=CAS(NO1).
I am new to python and pandas. Thank you for your help
you can reorder the columns by reassigning the df variable as a slice of itself indexed on a list whose entries are the column names in question.
colidx = ['Analyte', 'CAS NO(1)', 'CAS NO(2)']
df = df[colidx]
Better provide input data in text format so we can copy-paste it. I understand you question like this: You need to sort two last columns together, so that CAS NO(2) matches CAS NO(1).
Since CAS NO(2)=CAS(NO1) you then do not need duplicated CAS NO(2) column, right?
Split off two last columns and make a Series from it, then convert that series to dict, and use that dict to map new values.
# Split 2 last columns and assign index.
df_tmp = df[['CAS NO(2)', 'Value(2)']]
df_tmp = df_tmp.set_index('CAS NO(2)')
# Keep only 3 first columns of original dataframe
df = df[['Analyte',' CASNo(1)', 'Value(1)']]
# Now copy the CasNO(1) to CAS NO(2)
df['CAS NO(2)'] = df['CasNO(1)']
# Now create Value(2) column on original dataframe
df['Value(2)'] = df['CASNo(1)'].map(df_tmp.to_dict()['Value(2)'])
Try the following:
import pandas as pd
import numpy as np
#create an example of your table
list_CASNo1 = ['71-43-2', '100-41-4', np.nan, '1634-04-4']
list_Val1 = [np.nan]*len(list_CASNo1)
list_CASNo2 = [np.nan, np.nan, np.nan, '100-41-4']
list_Val2 = [np.nan, np.nan, np.nan, '18']
df = pd.DataFrame(zip(list_CASNo1, list_Val1, list_CASNo2, list_Val2), columns =['CASNo(1)','Value(1)','CAS NO(2)','Value(2)'], index = ['Benzene','Ethylbenzene','Gasonline Range Organics','Methyl-tert-butyl ether'])
#split the data to two dataframes
df1 = df[['CASNo(1)','Value(1)']]
df2 = df[['CAS NO(2)','Value(2)']]
#merge df2 to df1 based on the specified columns
#reset_index and set_index will take care
#that df_adjusted will have the same index names as df1
df_adjusted = df1.reset_index().merge(df2.dropna(),
how = 'left',
left_on = 'CASNo(1)',
right_on = 'CAS NO(2)').set_index('index')
but be careful with duplicates in your columns, those will cause the merge to fail..

Pandas Removing Leading Zeros

I have a short script to pivot data. The first column is a 9 digit ID number, often beginning with zeros such as 000123456
Here is the script:
df = pd.read_csv('source')
new_df = df.pivot_table(index = 'id', columns = df.groupby('id').cumcount().add(1), values = ['prog_id', 'prog_type'], aggfunc='first').sort_index(axis=1,level=1)
new_df.columns = [f'{x}_{y}' for x,y in new_df.columns]
new_df.to_csv('destination')
print(new_df)
Although the CSV is being read with an id such as 000123456, the output only contains 123456
Even when setting an explicit dtype, Pandas removes the leading zeros. Is there a work around for telling Pandas to leave the leading zeros?
Per comment on original post, set dtype as string:
df = pd.read_csv('source', dtype={'id':np.str})
You could use pandas' zfill() method right after reading your csv file "source". Basically, you would fill the values of your attribute "id", with as many zeros as you would like, in this particular case, making the number 9 digits long (3 zeros + 6 original digits). So, we would have:
df = pd.read_csv('source')
df.index = df.index.str.zfill(9)
# (...)

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

Pandas Re-indexing command

*RE Add missing dates to pandas dataframe, previously ask question
import pandas as pd
import numpy as np
idx = pd.date_range('09-01-2013', '09-30-2013')
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
df.index = pd.DatetimeIndex(df.index); #question (1)
df = df.reindex(idx, fill_value=np.nan)
print(df)
In the above script what does the command noted as question one do? If you leave this
command out of the script, the df will be re-indexed but the data portion of the
original df will not be retained. As there is no reference to the df data in the
DatetimeIndex command, why is the data from the starting df lost?
Short answer: df.index = pd.DatetimeIndex(df.index); converts the string index of df to a DatetimeIndex.
You have to make the distinction between different types of indexes. In
df = pd.DataFrame(data = [2,10,5,1], index = ["09-02-2013","09-03-2013","09-06-2013","09-07-2013"], columns = ["Events"])
you have an index containing strings. When using
df.index = pd.DatetimeIndex(df.index);
you convert this standard index with strings to an index with datetimes (a DatetimeIndex). So the values of these two types of indexes are completely different.
Now, when you reindex with
idx = pd.date_range('09-01-2013', '09-30-2013')
df = df.reindex(idx)
where idx is also an index with datetimes. When you reindex the original df with a string index, there are no matching index values, so no column values of the original df are retained. When you reindex the second df (after converting the index to a datetime index), there will be matching index values, so the column values on those indixes are retained.
See also http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.reindex.html

Categories

Resources