unable to grab particular columns from a CSV file - python

I am trying to access contents of a CSV file and parse it. I just need two columns out of entire CSV file . I can access the CSV and its contents but I need to limit it to the columns I need so that I can use the details from that columns
import os
import boto3
import pandas as pd
import sys
from io import StringIO # Python 3.x
session = boto3.session.Session(profile_name="rli-prod",region_name="us-east-1")
client = session.client("s3")
bucket_name = 'bucketname'
object_key = 'XX/YY/ZZ.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8-sig')
df = pd.read_csv(StringIO(csv_string))
print(df)
Right now, I am getting the entire CSV. Below is the output
0 63a2a854-a136-4bb1-a89b-a4e638b2be14 8128639b-a163-4e8e-b1f8-22e3dcd2b655 ... 123 63a2a854-a136-4bb1-a89b-a4e638b2be14
1 63a2a854-a136-4bb1-a89b-a4e638b2be14 8d6bdc73-f908-45d8-8d8a-c3ac0bee3b29 ... 123 63a2a854-a136-4bb1-a89b-a4e638b2be14
2 63a2a854-a136-4bb1-a89b-a4e638b2be14 1312e6f6-4c5f-4fa5-babd-93a3c0d3b502 ... 234 63a2a854-a136-4bb1-a89b-a4e638b2be14
3 63a2a854-a136-4bb1-a89b-a4e638b2be14 bfec5ccc-4449-401d-9898-9c523b1e1230 ... 456 63a2a854-a136-4bb1-a89b-a4e638b2be14
4 63a2a854-a136-4bb1-a89b-a4e638b2be14 522a72f0-2746-417c-9a59-fae4fb1e07d7 ... 567 63a2a854-a136-4bb1-a89b-a4e638b2be14
[5 rows x 9 columns]
Right now, My CSV doesnot have any headers , so only option I have is to grab using column number. But am not sure how to do that? Can anyone please assist?

Option 1:
If you already read the csv and want to do the dropping of other columns mid calculation. Use the index of which columns you want to use inside df.iloc.
Example:
>>> df #sample dataframe I want to get the first 2 columns only
Artist Count Test
0 The Beatles 4 1
1 Some Artist 2 1
2 Some Artist 2 1
3 The Beatles 4 1
4 The Beatles 4 1
5 The Beatles 4 1
>>> df3 = df.iloc[:,[0,1]]
>>> df3
Artist Count
0 The Beatles 4
1 Some Artist 2
2 Some Artist 2
3 The Beatles 4
4 The Beatles 4
5 The Beatles 4
Option 2
During the reading of the file itself, specify which columns to use under the parameter usecols of read_csv().
df = pd.read_csv(StringIO(csv_string), usecols = [place column index here])

strong textUse read_csv method from pandas library:
import pandas as pd
data = pd.read_csv('file.csv', usecols=[2, 4])
print(data.head())
The parameter usecols accepts the name of the column or index as a list

Since you are already utilizing the Pandas library, you should be able to accomplish this by passing the header= argument to the read_csv method like so:
# will pull columns indexed [0,2,4]
df = pd.read_csv(StringIO(csv_string), header=[0,2,4])
From the docs: ... The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped)...

In [15]: import pandas as pd
In [16]: d1 = {"col1" : "value11", "col2": "value21", "col3": "value31"}
In [17]: d2 = {"col1" : "value12", "col2": "value22", "col3": "value32"}
In [18]: d3 = {"col1" : "value13", "col2": "value23", "col3": "value33"}
In [19]: df = df.append(d1, ignore_index=True, verify_integrity=True, sort=False)
In [20]: df = df.append(d2, ignore_index=True, verify_integrity=True, sort=False)
In [21]: df = df.append(d3, ignore_index=True, verify_integrity=True, sort=False)
In [22]: df
Out[22]:
col1 col2 col3
0 value11 value21 value31
1 value12 value22 value32
2 value13 value23 value33
3 value11 value21 value31
4 value12 value22 value32
5 value13 value23 value33
In [23]: # Selecting only col1 and col3
In [24]: df_new = df[["col1", "col3"]]
In [25]: df_new
Out[25]:
col1 col3
0 value11 value31
1 value12 value32
2 value13 value33
3 value11 value31
4 value12 value32
5 value13 value33
In [26]:

Related

Creating Multi Indexed Dataframe with Its Name - Pandas/Python

Assuming I have this dataframe saved in my directory:
import numpy as np
import pandas as pd
col1 = pd.Series(np.linspace(1, 10, 20))
col2 = pd.Series(np.linspace(11, 20, 20))
data = np.array([col1, col2]).T
df = pd.DataFrame(data, columns = ["col1", "col2"])
df.to_csv("test.csv", index = False)
What I would like to do is to read this file and the name of the file as a column on top of the other columns to get something like this:
How can I do this?
Use pathlib to extract the file name using .stem and pd.concat to create a multi level column:
import pathlib
filename = pathlib.Path('path/to/test.csv')
df = pd.concat({filename.stem.capitalize(): pd.read_csv(filename)}, axis=1)
print(df)
# Output:
Test
col1 col2
0 1.000000 11.000000
1 1.473684 11.473684
2 1.947368 11.947368
3 2.421053 12.421053
4 2.894737 12.894737
5 3.368421 13.368421
6 3.842105 13.842105
7 4.315789 14.315789
8 4.789474 14.789474
9 5.263158 15.263158
10 5.736842 15.736842
11 6.210526 16.210526
12 6.684211 16.684211
13 7.157895 17.157895
14 7.631579 17.631579
15 8.105263 18.105263
16 8.578947 18.578947
17 9.052632 19.052632
18 9.526316 19.526316
19 10.000000 20.000000
Use MultiIndex.from_product:
file = 'test.csv'
df = pd.read_csv(file)
name = file.split('.')[0].capitalize()
df.columns= pd.MultiIndex.from_product([[name],df.columns])
print (df.head())
Test
col1 col2
0 1.000000 11.000000
1 1.473684 11.473684
2 1.947368 11.947368
3 2.421053 12.421053
4 2.894737 12.894737

How to read data from url to pandas dataframe

I'm trying to read data from https://download.bls.gov/pub/time.series/ee/ee.industry using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
df = pd.read_csv(url, sep='\t')
Also tried getting the separator:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
reader = pd.read_csv(url, sep = None, iterator = True)
inferred_sep = reader._engine.data.dialect.delimiter
df = pd.read_csv(url, sep=inferred_sep)
However the data is not very weelk formated, the columns of the dataframe are right:
>>> df.columns
Index(['industry_code', 'SIC_code', 'publishing_status', 'industry_name'], dtype='object')
But the data does not correspond to the columns, it seems all the data is merged into the fisrt two columns and the last two do not have any data.
Any suggestion/idea on a better approach to fet this data?
EDIT
The expexted result should be something like:
industry_code
SIC_code
publishing_status
industry_name
000000
N/A
B
Total nonfarm 1 T 1
The reader works well but you don’t have the right number of columns in your header. You can get the other columns back using .reset_index() and then rename the columns:
>>> df = pd.read_csv(url, sep='\t')
>>> n_missing_headers = df.index.nlevels
>>> cols = df.columns.to_list() + [f'col{n}' for n in range(n_missing_headers)]
>>> df.reset_index(inplace=True)
>>> df.columns = cols
>>> df.head()
industry_code SIC_code publishing_status industry_name col0 col1 col2
0 0 NaN B Total nonfarm 1 T 1
1 5000 NaN A Total private 1 T 2
2 5100 NaN A Goods-producing 1 T 3
3 100000 10-14 A Mining 2 T 4
4 101000 10 A Metal mining 3 T 5
You can then keep the first 4 columns if you want:
>>> df.iloc[:, :-n_missing_headers].head()
industry_code SIC_code publishing_status industry_name
0 0 NaN B Total nonfarm
1 5000 NaN A Total private
2 5100 NaN A Goods-producing
3 100000 10-14 A Mining
4 101000 10 A Metal mining

Can you simultaneously select and assign a column in a pandas DataFrame?

Using data.table in R, you can simultaneously select and assign columns. Assume one has a data.table with 3 columns--col1, col2, and col3. One could do the following using data.table:
dt2 <- dt[, .(col1, col2, newcol = 3, anothercol = col3)]
I want to do something similar in pandas but it looks like it would take 3 lines.
df2 = df.copy()
df2['newcol'] = 3
df2.rename(columns = {"col3" : "anothercol"})
Is there a more concise way to do what I did above?
This might work:
import pandas as pd
ddict = {
'col1':['A','A','B','X'],
'col2':['A','A','B','X'],
'col3':['A','A','B','X'],
}
df = pd.DataFrame(ddict)
df.loc[:, ['col1', 'col2', 'col3']].rename(columns={"col3":"anothercol"}).assign(newcol=3)
result:
col1 col2 anothercol newcol
0 A A A 3
1 A A A 3
2 B B B 3
3 X X X 3
I don't know R, but what I'm seeing is that you are adding a new column called newcol that has a value of 3 on all the rows.
also you are renaming a column from col3 to anothercol.
you don't really need do the copy step.
df2 = df.rename(columns = {'col3': 'anothercol'})
df2['newcol'] = 3
You can use df.assign for that :
Example :
>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]},
index=['Portland', 'Berkeley'])
>>> df
temp_c
Portland 17.0
Berkeley 25.0
>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0
>>> df.assign(newcol=3).rename(columns={"temp_c":"anothercol"}
anothercol newcol
Portland 17.0 3
Berkeley 25.0 3
And then you can assign it to df2.
First examples taken from pandas Docs

Save pandas pivot_table to include index and columns names

I want to save a pandas pivot table for human reading, but DataFrame.to_csv doesn't include the DataFrame.columns.name. How can I do that?
Example:
For the following pivot table:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [6, 7, 8]])
>>> df.columns = list("ABC")
>>> df.index = list("XY")
>>> df
A B C
X 1 2 3
Y 6 7 8
>>> p = pd.pivot_table(data=df, index="A", columns="B", values="C")
When viewing the pivot table, we have both the index name ("A"), and the columns name ("B").
>>> p
B 2 7
A
1 3.0 NaN
6 NaN 8.0
But when exporting as a csv we lose the columns name:
>>> p.to_csv("temp.csv")
===temp.csv===
A,2,7
1,3.0,
6,,8.0
How can I get some kind of human-readable output format which contains the whole of the pivot table, including the .columns.name ("B")?
Something like this would be fine:
B,2,7
A,,
1,3.0,
6,,8.0
Yes, it is possible by append helper DataFrame, but reading file is a bit complicated:
p1 = pd.DataFrame(columns=p.columns, index=[p.index.name]).append(p)
p1.to_csv('temp.csv',index_label=p.columns.name)
B,2,7
A,,
1,3.0,
6,,8.0
#set first column to index
df = pd.read_csv('temp.csv', index_col=0)
#set columns and index names
df.columns.name = df.index.name
df.index.name = df.index[0]
#remove first row of data
df = df.iloc[1:]
print (df)
B 2 7
A
1 3.0 NaN
6 NaN 8.0

How to assign a value_count output to a dataframe

I am trying to assign the output from a value_count to a new df. My code follows.
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, names=['date','bill_id','sponsor_id']) for f in glob.glob('/home/jayaramdas/anaconda3/df/s11?_s_b')))
column_list = ['date', 'bill_id']
df = df.set_index(column_list, drop = True)
df = df['sponsor_id'].value_counts()
df.columns=['sponsor', 'num_bills']
print (df)
The value count is not being assigned the column headers specified 'sponsor', 'num_bills'. I'm getting the following output from print.head
1036 426
791 408
1332 401
1828 388
136 335
Name: sponsor_id, dtype: int64
your column length doesn't match, you read 3 columns from the csv and then set the index to 2 of them, you calculated value_counts which produces a Series with the column values as the index and the value_counts as the values, you need to reset_index and then overwrite the column names:
df = df.reset_index()
df.columns=['sponsor', 'num_bills']
Example:
In [276]:
df = pd.DataFrame({'col_name':['a','a','a','b','b']})
df
Out[276]:
col_name
0 a
1 a
2 a
3 b
4 b
In [277]:
df['col_name'].value_counts()
Out[277]:
a 3
b 2
Name: col_name, dtype: int64
In [278]:
type(df['col_name'].value_counts())
Out[278]:
pandas.core.series.Series
In [279]:
df = df['col_name'].value_counts().reset_index()
df.columns = ['col_name', 'count']
df
Out[279]:
col_name count
0 a 3
1 b 2
Appending value_counts() to multi-column dataframe:
df = pd.DataFrame({'C1':['A','B','A'],'C2':['A','B','A']})
vc_df = df.value_counts().to_frame('Count').reset_index()
display(df, vc_df)
C1 C2
0 A A
1 B B
2 A A
C1 C2 Count
0 A A 2
1 B B 1

Categories

Resources