How to read csv formatted numeric data into Pandas - python

I have a csv file with two formatted columns that currently read in as objects:
contains percentage values which read in as strings like '0.01%'. The % is always at the end.
contains currency values which read in as string like '$1234.5'.
I have tried using the split function to remove the % or $ inside the dataframe, then using float on the result of the split. This will print the correct result but will not assign the value. It also gives a type error that float does not have split function, even though I do the split before the float????

Try this:
import pandas as pd
df = pd.read_csv('data.csv')
"""
The example df looks like this:
col1 col2
0 3.04% $100.25
1 0.15% $1250
2 0.22% $322
3 1.30% $956
4 0.49% $621
"""
df['col1'] = df['col1'].str.split('%', expand=True)[[0]]
df['col2'] = df['col2'].str.split('$', 1, expand=True)[[1]]
df[['col1', 'col2']] = df[['col1', 'col2']].apply(pd.to_numeric)

You are probably looking for the apply method.
With
df['first_col'] = df['first_col'].apply(lambda x: float(x.strip('%'))

Related

Clean way to convert string containing decimal to a string containing an int for a column in pandas?

I have a dataframe where one column contains numbers but as string values like "1.0", "52.0" etc.
I want to convert the column to instead contain strings like "PRE_1", "PRE_52".
Example
df = pd.DataFrame([['1.0'],['52.0']],columns=['Pre'])
df["pre"] = 'PRE_' + df["pre"].astype(str)
gives me output of PRE_1.0
I tried:
df["pre"] = 'PRE_' + df["pre"].astype(int).astype(str) but got a ValueError.
Do I need to convert it into something else before trying to convert it to an int?
It looks like: df["pre"].astype(float).astype(int).astype(str) might do what I want but I'm open to cleaner ways of doing it.
I'm pretty new to pandas, so help would be greatly appreciated!
To properly be able to help, having sample data would be great. Based on the information you did provide, if the data coming in is a float, you can apply a format to truncate it as below.
df = pd.DataFrame({'pre': [1.0, 52.0]})
df['pre'] = df['pre'].map('PRE_{:.0f}'.format)
print(df)
Apply a function:
import pandas as pd
df = pd.DataFrame([['1.0'],['52.0']],columns=['Pre'])
print(df)
df.Pre = df.Pre.apply(lambda n: f'PRE_{float(n):.0f}')
print(df)
Output:
Pre
0 1.0
1 52.0
Pre
0 PRE_1
1 PRE_52

How to sort dataframe by value in Pandas?

I have a data set in csv that I read with pd.read_csv. I want to sort the esxisting data by descanding value.
my code is this:
dataset = pd.read_csv('Kripto.csv')
sorted = dataset.sort_values(by = "Freq1", ascending=False)
x = dataset.iloc[:, :].values
and my data set (print(dataset)) is this :
Letter;Freq1
0 A;0.0817
1 B;0.0150
2 C;0.0278
3 D;0.0425
4 E;0.1270
when i want to use this code:
sorted = dataset.sort_values(by = "Freq1", ascending=False)
python gives me an error and says KeyError: 'Freq1'
I know that "Freq1" is not the name of the column but ı have no idea how to assing a name
Your csv file has " ; " as separator, you need to indicate on the read_csv method:
import pandas as pd
dataset = pd.read_csv('your.csv', sep=';')
And that's all you need to do
Your CSV file uses semi-colons to separate values. Since Pandas by defaults expects comma's, use
dataset = pd.read_csv('Kripto.csv', sep=';')
instead.
You should also use the sorted dataset to get your values in sorted order, instead of dataset, since the latter will remain unsorted:
x = sorted.iloc[:, :].values

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

Round pandas dataframe numeric values in string type columns

I did searched online posts but what I found were all how to only round float columns in a mixed dataframe, but my problem is how to round float values in a string type column.
Say my dataframe like this:
pd.DataFrame({'a':[1.1111,2.2222, 'aaaa'], 'b':['bbbb', 2.2222,3.3333], 'c':[3.3333,'cccc', 4.4444]})
Looking for an output like
pd.DataFrame({'a':[1.1,2.2, 'aaaa'], 'b':['bbbb', 2.2,3.3], 'c':[3.3,'cccc', 4.4]})
----Above is a straight question------
----Reason why I do so is below----
I have 3 csv files, each has string header and float value, with different row and column number.
I need to append the 3 in one dataframe horizontally then expoert as a new csv, separate by a empty row.
My 3 dataframe like this:
One:
Two:
Three:
to
Pls note that the output dataframe contains headers from the 3 sub dataframe
So, when I import them, first csv of course imported by pd.read_csv, no issue.
Then I used .append(pd.Series([np.NaN])) to create an empty row as separator row
Then second csv loaded then I used pd.append(), but if I don't include 'header=None' in 'read_csv()' then the second one will not be mapped horizontally under first one, coz the csv files have uneven rows and columns.
So two options,
Include 'header=None' in 'read_csv()', then I can't simply use round() as
df = df.round()
does not work, need to find a way to round only numeric values in each column
Also note that when include 'header=None',
All column types are 'object', by df.types
Not include 'header=None' in 'read_csv()', then I could round each dataframe, but having trouble to combine them horizontally with their headers.
Any suggestion?
csv example
import pandas as pd
import io
exp = io.StringIO("""
month;abc;cba;fef;sefe;yjy;gtht
100;0.45384534;0.43455;0.56385;0.5353;0.523453;0.53553
200;0.453453;0.453453;0.645396;0.76786;0.36327;0.453659
""")
df = pd.read_csv(exp, sep=";", header=None)
print(df.dtypes)
df = df.applymap(lambda x: round(x, 1)
if isinstance(x, (int, float)) else x)
print(df)
There is a simple way to loop over every single element in a dataframe using applymap. Combined with isinstance, which test for a specific type, you can get the following.
df = pd.DataFrame({'a':[1.1111,2.2222, 'aaaa'], 'b':['bbbb', 2.2222,3.3333], 'c':[3.3333,'cccc', 4.4444]})
df.dtypes
a object
b object
c object
dtype: object
df2 = df.applymap(lambda x: round(x, 1) if isinstance(x, (int, float)) else x)
Obtaining the following dataframe:
a b c
0 1.1 bbbb 3.3
1 2.2 2.2 cccc
2 aaaa 3.3 4.4
With the following dtypes unchanged
df2.dtypes
a object
b object
c object
dtype: object
As for your other example in your question, I noticed that even the numbers are saved as strings. I noticed a method converting strings to floats pd.to_numeric for a Series.
From your exp, I get the following:
df = pd.read_csv(exp, sep=";", header=None)
df2 = df.apply(lambda x: pd.to_numeric(x, errors='ignore'), axis=1)
df3 = df2.applymap(lambda x: round(x, 1) if isinstance(x, (int, float)) else x)

Why does referencing a concatenated pandas dataframe return multiple entries?

When I create a dataframe using concat like this:
import pandas as pd
dfa = pd.DataFrame({'a':[1],'b':[2]})
dfb = pd.DataFrame({'a':[3],'b':[4]})
dfc = pd.concat([dfa,dfb])
And I try to reference like I would for any other DataFrame I get the following result:
>>> dfc['a'][0]
0 1
0 3
Name: a, dtype: int64
I would expect my concatenated DataFrame to behave like a normal DataFrame and return the integer that I want like this simple DataFrame does:
>>> dfa['a'][0]
1
I am just a beginner, is there a simple explanation for why the same call is returning an entire DataFrame and not the single entry that I want? Or, even better, an easy way to get my concatenated DataFrame to respond like a normal DataFrame when I try to reference it? Or should I be using something other than concat?
You've mistaken what normal behavior is. dfc['a'][0] is a label lookup and matches anything with an index value of 0 in which there are two because you concatenated two dataframes with index values including 0.
in order to specify position of 0
dfc['a'].iloc[0]
or you could have constructed dfc like
dfc = pd.concat([dfa,dfb], ignore_index=True)
dfc['a'][0]
Both returning
1
EDITED (thx piRSquared's comment)
Use append() instead pd.concat():
dfc = dfa.append(dfb, ignore_index=True)
dfc['a'][0]
1

Categories

Resources