read_csv multi indexing dataframe - python

I have the following csv:
value value value value ...
id 1 1 1 2
indic 1 2 3 1
valuedate
05/01/1970 1.0 2.0 3.2 5.2
06/01/1970 4.1 ...
07/01/1970
08/01/1970
that I want to read in a pandas DataFrame, so I do the following:
df=pd.read_csv("mycsv.csv", skipinitialspace=True, tupleize_cols=True)
but get the following error:
IndexError: Too many levels: Index has only 1 level, not 2
I suspect there might be an error with the multi indexing but I don't understand how to use the parameters of read_csv in order to solve this.
(NB: valuedate is the name of the index column)
I want to get this data into a DataFrame that would be multi-indexed: several indic sub columns under the id column.

file.csv:
value value value value
id 1 1 1 2
indic 1 2 3 1
valuedate
05/01/1970 1.0 2.0 3.2 5.2
Do:
import pandas as pd
df = pd.read_csv("file.csv", index_cols=0, delim_whitespace=True)
print(df)
Output:
value value.1 value.2 value.3
id 1.0 1.0 1.0 2.0
indic 1.0 2.0 3.0 1.0
valuedate NaN NaN NaN NaN
05/01/1970 1.0 2.0 3.2 5.2

Related

Create new column showing the occurrences of a column value in a range of others

I have a simple pandas DataFrame where I need to add a new column that shows the 'count' of occurrences for the 'current_price' in a range of other columns 'pricemonths', that match the current_price column:
import pandas as pd
import numpy as np
# my data
data = {'Item':['Bananas', 'Apples', 'Pears', 'Avocados','Grapes','Melons'],
'Jan':[1,0.5,1.1,0.6,2,4],
'Feb':[0.9,0.5,1,0.6,2,5],
'Mar':[1,0.6,1,0.6,2.1,6],
'Apr':[1,0.6,1,0.6,2,5],
'May':[1,0.5,1.1,0.6,2,5],
'Current_Price':[1,0.6,1,0.6,2,4]
}
# import my data
df = pd.DataFrame(data)
pricemonths=['Jan','Feb','Mar','Apr','May']
Thus, my final dataframe would contain another column ('times_found') with the values:
'times_found'
4
2
3
5
4
1
One way of doing it is to transpose the prices part of df, then use eq to compare with "Current_Price" across indices (which creates a boolean DataFrame with True for matching prices and False otherwise) and find sum across rows:
df['times_found'] = df['Current_Price'].eq(df.loc[:,'Jan':'May'].T).sum(axis=0)
or use numpy broadcasting:
df['times_found'] = (df.loc[:,'Jan':'May'].to_numpy() == df[['Current_Price']].to_numpy()).sum(axis=1)
Excellent suggestion from #HenryEcker: DataFrame equals on an axis may be faster than transposing for larger DataFrames:
df['times_found'] = df.loc[:, 'Jan':'May'].eq(df['Current_Price'], axis=0).sum(axis=1)
Output:
Item Jan Feb Mar Apr May Current_Price times_found
0 Bananas 1.0 0.9 1.0 1.0 1.0 1.0 4
1 Apples 0.5 0.5 0.6 0.6 0.5 0.6 2
2 Pears 1.1 1.0 1.0 1.0 1.1 1.0 3
3 Avocados 0.6 0.6 0.6 0.6 0.6 0.6 5
4 Grapes 2.0 2.0 2.1 2.0 2.0 2.0 4
5 Melons 4.0 5.0 6.0 5.0 5.0 4.0 1

changing index of dataframe: getting attribute error

So I am working in Python trying to change the index of my dataframe.
Here is my code:
df = pd.read_csv("data_file.csv", na_values=' ')
table = df['HINCP'].groupby(df['HHT'])
print(table.describe()[['mean', 'std', 'count', 'min', 'max']].sort_values('mean', ascending=False))
Here is the dataframe currently:
mean std count min max
HHT
1.0 106790.565562 100888.917804 25495.0 -5100.0 1425000.0
5.0 79659.567376 74734.380152 1410.0 0.0 625000.0
7.0 69055.725901 63871.751863 1193.0 0.0 645000.0
2.0 64023.122122 59398.970193 1998.0 0.0 610000.0
3.0 49638.428821 48004.399101 5718.0 -5100.0 609000.0
4.0 48545.356298 60659.516163 5835.0 -5100.0 681000.0
6.0 37282.245015 44385.091076 8024.0 -11200.0 676000.0
I want the index values to be like this instead of the numbered 1,2,...,7:
Married couple household
Nonfamily household:Male
Nonfamily household:Female
Other family household:Male
Other family household:Female
Nonfamily household:Male
Nonfamily household:Female
I tried using a set_index() as an attribute of table, where I set the key equal to a list of the index above that I want, but this gives me this error:
AttributeError: 'SeriesGroupBy' object has no attribute 'set_index'
I was also wondering if there was any way to alter the HHT label at the top of the index, or will that come with changing the index values
>>> df = pd.DataFrame(columns = ["HHT", "HINC"], data = np.transpose([[2,3,2,2,2,3,3,3,4], [1,1,3,1,4,7,8,9,11]]))
>>> df
HHT HINC
0 2 1
1 3 1
2 2 3
3 2 1
4 2 4
5 3 7
6 3 8
7 3 9
8 4 11
>>> table = df['HINC'].groupby(df['HHT'])
>>> td = table.describe()
>>> df2 = pd.DataFrame(td)
>>> df2.index = ['lab1', 'lab2', 'lab3']
>>> df2
count mean std min 25% 50% 75% max
lab1 4.0 2.25 1.500000 1.0 1.0 2.0 3.25 4.0
lab2 4.0 6.25 3.593976 1.0 5.5 7.5 8.25 9.0
lab3 1.0 11.00 NaN 11.0 11.0 11.0 11.00 11.0

how to read a column with another column name in pandas?

I have to read and process set of files.(eg: 100 files) In which one file come with a column name as 'Idass'. Other files come with the column name 'IdassId'.
After processing I select few columns and writing the output in excel.
df.to_excel(writer, columns=['Date','IdassId','TankNo','GradeNo','Sales'],sheet_name='sales')
Here I miss that single file's entry since it doesn't have column name as 'IdassId'. It contains that specific column with 'Idass'.
(I could not rename that column before processing since it is an automated process coming form another process).
I tried rename that column with IdassId and tried to write in excel.
d = {'Idass': 'IdassId'}
df.rename(columns=d).to_excel(writer, columns=['Date','IdassId','TankNo','GradeNo','Sales'],sheet_name='sales')
but above gives an error since another files come with same column name as 'idassId'
ValueError: cannot reindex from a duplicate axis
How to do this in pandas?
I'm assuming your concatenating the excel files together so it would look similar to below.
Idass IdassId
0 0.0 NaN
1 1.0 NaN
2 2.0 NaN
3 3.0 NaN
4 4.0 NaN
5 NaN 0.0
6 NaN 1.0
7 NaN 2.0
8 NaN 3.0
9 NaN 4.0
If you were to rename Idass to IdassId then you would have two columns named IdassId and that is what is causing your error.
You should be able to fill in the null values of IdassID and get your desired result.
df['IdassId'] = df['IdassId'].where(df['IdassId'].notnull(), df['Idass'])
df.drop('Idass', axis=1, inplace=True)
IdassId
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 0.0
6 1.0
7 2.0
8 3.0
9 4.0
It looks like your defined d{} is trying to rename Idass to IdassId. I think your key:value pairing is switched.

Pandas DataFrame: most data in columns are 'float' , I want to delete the row which is 'str'

wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]}
I want to delete the row with 'hhh', because all datas in 'a' are numbers.
The original data size is huge. Thank you very much.
Option 1
Convert a using pd.to_numeric
df.a = pd.to_numeric(df.a, errors='coerce')
df
a b
0 NaN 1.0
1 2.0 2.0
2 3.0 NaN
3 4.0 NaN
4 5.0 5.0
Non-Numeric columns are coerced to NaN. You can then drop this row -
df.dropna(subset=['a'])
a b
1 2.0 2.0
2 3.0 NaN
3 4.0 NaN
4 5.0 5.0
Option 2
Another alternative is using str.isdigit -
df.a.str.isdigit()
0 False
1 NaN
2 NaN
3 NaN
4 NaN
Name: a, dtype: object
Filter as such -
df[df.a.str.isdigit().isnull()]
a b
1 2 2.0
2 3 NaN
3 4 NaN
4 5 5.0
Notes -
This won't work for float columns
If the numbers are also as strings, then drop the isnull bit -
df[df.a.str.isdigit()]
import pandas as pd
import numpy as np
wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})
#wu = wu[wu.a.str.contains('\d+',na=False)]
#wu = wu[wu.a.apply(lambda x: x.isnumeric())]
wu = wu[wu.a.apply(lambda x: isinstance(x, (int, np.int64)))]
print(wu)
Note that you missed out a closing parenthesis when creating your DataFrame.
I tried 3 ways, but only the third one worked. You can always try the other ones (commented out) if that works for you. Do let me know if it works on the larger dataset.
df = pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})
df.drop(df[df['a'].apply(type) != int].index, inplace=True)
if you just want to view the appropriate rows:
df.loc[df['a'].apply(type) != int, :]

Filter out CSV values after a space in python

So my goal is to read CSV file created by a Geocoder that has annoyingly put string values with a space and latitude or longitude values… I could go through all of these excel cells and split them manually, but I would really like to read CSV instead and just use the space as the delimiter and filter out all of the string values. I know how to import CSV, and even how to specify space as the delimiter I think I… But what I don't understand is how to filter out all of the string values and save only the numeric values in a brand new Excel sheet. Does anyone know how to do this?
Here is the code I have so far to delimit the white space:
pd.read_csv('file.csv',delim_whitespace=True)
Use pd.read_csv to read your CSV, select_dtypes to select only numeric columns, and save only numeric columns to a CSV using to_csv.
df = pd.read_csv('file.csv', delim_whitespace=True)
df.select_dtypes(['float']).to_csv('file.csv')
If your file has no headers, you'll need to add header=None when reading the CSV.
df
a b c
0 1.0 0 foo
1 2.0 0 NaN
2 1.0 1 bar
3 1.0 1 foo
4 NaN 1 baz
5 3.0 1 foo
6 3.0 1 bar
df.select_dtypes(['float'])
a
0 1.0
1 2.0
2 1.0
3 1.0
4 NaN
5 3.0
6 3.0
If, for some reason, you have integeral columns you want to save, change float to number:
df.select_dtypes(['number'])
a b
0 1.0 0
1 2.0 0
2 1.0 1
3 1.0 1
4 NaN 1
5 3.0 1
6 3.0 1
And just chain a .to_csv call.
If you get the data separated as you should you can use this:
df.convert_objects(convert_numeric=True).dropna(axis=1)
and you can add .to_csv('your_file_name.csv') at the end.

Categories

Resources