I'm trying to read data from https://download.bls.gov/pub/time.series/ee/ee.industry using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
df = pd.read_csv(url, sep='\t')
Also tried getting the separator:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
reader = pd.read_csv(url, sep = None, iterator = True)
inferred_sep = reader._engine.data.dialect.delimiter
df = pd.read_csv(url, sep=inferred_sep)
However the data is not very weelk formated, the columns of the dataframe are right:
>>> df.columns
Index(['industry_code', 'SIC_code', 'publishing_status', 'industry_name'], dtype='object')
But the data does not correspond to the columns, it seems all the data is merged into the fisrt two columns and the last two do not have any data.
Any suggestion/idea on a better approach to fet this data?
EDIT
The expexted result should be something like:
industry_code
SIC_code
publishing_status
industry_name
000000
N/A
B
Total nonfarm 1 T 1
The reader works well but you don’t have the right number of columns in your header. You can get the other columns back using .reset_index() and then rename the columns:
>>> df = pd.read_csv(url, sep='\t')
>>> n_missing_headers = df.index.nlevels
>>> cols = df.columns.to_list() + [f'col{n}' for n in range(n_missing_headers)]
>>> df.reset_index(inplace=True)
>>> df.columns = cols
>>> df.head()
industry_code SIC_code publishing_status industry_name col0 col1 col2
0 0 NaN B Total nonfarm 1 T 1
1 5000 NaN A Total private 1 T 2
2 5100 NaN A Goods-producing 1 T 3
3 100000 10-14 A Mining 2 T 4
4 101000 10 A Metal mining 3 T 5
You can then keep the first 4 columns if you want:
>>> df.iloc[:, :-n_missing_headers].head()
industry_code SIC_code publishing_status industry_name
0 0 NaN B Total nonfarm
1 5000 NaN A Total private
2 5100 NaN A Goods-producing
3 100000 10-14 A Mining
4 101000 10 A Metal mining
Related
I have input dataframe df1 like this:
url_asia_us_emea
https://asia/image
https://asia/location
https://asia/video
I want to replicate the rows in df1 with changes in region based on the column names.Lets say for this example , I need the below output as all three regions are in the column name
url_asia_us_emea
https://asia/image
https://asia/location
https://asia/video
https://us/image
https://us/location
https://us/video
https://emea/image
https://emea/location
https://emea/video
You could do something like this:
list_strings = [ 'us', 'ema', 'fr']
df_orig =pd.read_csv("ack.csv", sep=";")
which is
url
0 https://asia/image
1 https://asia/location
2 https://asia/video
and then
d = []
for element in list_strings:
df =pd.read_csv("ack.csv", sep=";")
df['url'] = df['url'].replace({'asia': str(element)}, regex=True)
d.append(df)
df = pd.concat(d)
DF = pd.concat([df,df_orig])
which results in
index url
0 https://us/image
1 https://us/location
2 https://us/video
3 https://ema/image
4 https://ema/location
5 https://ema/video
6 https://fr/image
7 https://fr/location
8 https://fr/video
9 https://asia/image
10 https://asia/location
11 https://asia/video
I am trying to access contents of a CSV file and parse it. I just need two columns out of entire CSV file . I can access the CSV and its contents but I need to limit it to the columns I need so that I can use the details from that columns
import os
import boto3
import pandas as pd
import sys
from io import StringIO # Python 3.x
session = boto3.session.Session(profile_name="rli-prod",region_name="us-east-1")
client = session.client("s3")
bucket_name = 'bucketname'
object_key = 'XX/YY/ZZ.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8-sig')
df = pd.read_csv(StringIO(csv_string))
print(df)
Right now, I am getting the entire CSV. Below is the output
0 63a2a854-a136-4bb1-a89b-a4e638b2be14 8128639b-a163-4e8e-b1f8-22e3dcd2b655 ... 123 63a2a854-a136-4bb1-a89b-a4e638b2be14
1 63a2a854-a136-4bb1-a89b-a4e638b2be14 8d6bdc73-f908-45d8-8d8a-c3ac0bee3b29 ... 123 63a2a854-a136-4bb1-a89b-a4e638b2be14
2 63a2a854-a136-4bb1-a89b-a4e638b2be14 1312e6f6-4c5f-4fa5-babd-93a3c0d3b502 ... 234 63a2a854-a136-4bb1-a89b-a4e638b2be14
3 63a2a854-a136-4bb1-a89b-a4e638b2be14 bfec5ccc-4449-401d-9898-9c523b1e1230 ... 456 63a2a854-a136-4bb1-a89b-a4e638b2be14
4 63a2a854-a136-4bb1-a89b-a4e638b2be14 522a72f0-2746-417c-9a59-fae4fb1e07d7 ... 567 63a2a854-a136-4bb1-a89b-a4e638b2be14
[5 rows x 9 columns]
Right now, My CSV doesnot have any headers , so only option I have is to grab using column number. But am not sure how to do that? Can anyone please assist?
Option 1:
If you already read the csv and want to do the dropping of other columns mid calculation. Use the index of which columns you want to use inside df.iloc.
Example:
>>> df #sample dataframe I want to get the first 2 columns only
Artist Count Test
0 The Beatles 4 1
1 Some Artist 2 1
2 Some Artist 2 1
3 The Beatles 4 1
4 The Beatles 4 1
5 The Beatles 4 1
>>> df3 = df.iloc[:,[0,1]]
>>> df3
Artist Count
0 The Beatles 4
1 Some Artist 2
2 Some Artist 2
3 The Beatles 4
4 The Beatles 4
5 The Beatles 4
Option 2
During the reading of the file itself, specify which columns to use under the parameter usecols of read_csv().
df = pd.read_csv(StringIO(csv_string), usecols = [place column index here])
strong textUse read_csv method from pandas library:
import pandas as pd
data = pd.read_csv('file.csv', usecols=[2, 4])
print(data.head())
The parameter usecols accepts the name of the column or index as a list
Since you are already utilizing the Pandas library, you should be able to accomplish this by passing the header= argument to the read_csv method like so:
# will pull columns indexed [0,2,4]
df = pd.read_csv(StringIO(csv_string), header=[0,2,4])
From the docs: ... The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped)...
In [15]: import pandas as pd
In [16]: d1 = {"col1" : "value11", "col2": "value21", "col3": "value31"}
In [17]: d2 = {"col1" : "value12", "col2": "value22", "col3": "value32"}
In [18]: d3 = {"col1" : "value13", "col2": "value23", "col3": "value33"}
In [19]: df = df.append(d1, ignore_index=True, verify_integrity=True, sort=False)
In [20]: df = df.append(d2, ignore_index=True, verify_integrity=True, sort=False)
In [21]: df = df.append(d3, ignore_index=True, verify_integrity=True, sort=False)
In [22]: df
Out[22]:
col1 col2 col3
0 value11 value21 value31
1 value12 value22 value32
2 value13 value23 value33
3 value11 value21 value31
4 value12 value22 value32
5 value13 value23 value33
In [23]: # Selecting only col1 and col3
In [24]: df_new = df[["col1", "col3"]]
In [25]: df_new
Out[25]:
col1 col3
0 value11 value31
1 value12 value32
2 value13 value33
3 value11 value31
4 value12 value32
5 value13 value33
In [26]:
I want to merge 3 columns into a single column. I have tried changing the column types. However, I could not do it.
For example, I have 3 columns such as A: {1,2,4}, B:{3,4,4}, C:{1,1,1}
Output expected: ABC Column {131, 241, 441}
My inputs are like this:
df['ABC'] = df['A'].map(str) + df['B'].map(str) + df['C'].map(str)
df.head()
ABC {13.01.0 , 24.01.0, 44.01.0}
The type of ABC seems object and I could not change via str, int.
df['ABC'].apply(str)
Also, I realized that there are NaN values in A, B, C column. Is it possible to merge these even with NaN values?
# Example
import pandas as pd
import numpy as np
df = pd.DataFrame()
# Considering NaN's in the data-frame
df['colA'] = [1,2,4, np.NaN,5]
df['colB'] = [3,4,4,3,np.NaN]
df['colC'] = [1,1,1,4,1]
# Using pd.isna() to check for NaN values in the columns
df['colA'] = df['colA'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colB'] = df['colB'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colC'] = df['colC'].apply(lambda x: x if pd.isna(x) else str(int(x)))
# Filling the NaN values with a blank space
df = df.fillna('')
# Transform columns into string
df = df.astype(str)
# Concatenating all together
df['ABC'] = df.sum(axis=1)
A workaround your NaN problem could look like this but now NaN will be 0
import numpy as np
df = pd.DataFrame({'A': [1,2,4, np.nan], 'B':[3,4,4,4], 'C':[1,np.nan,1, 3]})
df = df.replace(np.nan, 0, regex=True).astype(int).applymap(str)
df['ABC'] = df['A'] + df['B'] + df['C']
output
A B C ABC
0 1 3 1 131
1 2 4 0 240
2 4 4 1 441
3 0 4 3 043
I have three dataframes of different length. I am combining them into one dataframe for saving it. Now, I want to retrieve individual dataframe data from the combined dataframe using index. A sample of my problem is given below:
df1 =
data
0 10
1 20
df2 =
data
0 100
1 200
2 300
df3 =
data
0 1000
1 2000
2 3000
3 4000
combdf = pd.concat ([df1,df2,df3],ignore_index=True])
combdf =
data
0 10
1 20
2 100
3 200
4 300
5 1000
6 2000
7 3000
8 4000
I want to retrieve data of individual data frames from combdf. My code:
data_len = [len(df1),len(df2),len(df3)]
for k in range(0,len(data_len),1):
if k==0:
st_id = 0
else:
st_id = sum(data_len[:k])
ed_id = st_id+data_len[k]
print(combdf.iloc[st_id:ed_id])
Above code is working fine. Is there a better approach than this which does not use for loop?
Instead of calculating the indices while looping you can generate them first then use those to loop.
data_len = [0, len(df1),len(df2),len(df3)]
data_index = np.cumsum(data_len) #contains [0,2,5,11]
for i in range(len(data_index)-1):
print(df.iloc[data_index[i]:data_index[i+1]])
You could create a second index column with pd.MultiIndex that has the name of the original DataFrame. Below you can see a sample of how you could do this:
import pandas as pd
df_dict = {}
df_dict['df1'] = pd.DataFrame([10, 20])
df_dict['df2'] = pd.DataFrame([100, 200, 300])
df_dict['df3'] = pd.DataFrame([1000, 2000, 3000, 4000])
for df_name, df in df_dict.items():
# Generate second level of index
df_index_to_array = df.index.tolist()
df_index_second_level = [df_name for i in range(0, df.shape[0])]
df_idx_multi_index = pd.MultiIndex.from_arrays([
df_index_to_array,
df_index_second_level
])
df_dict[df_name] = df.set_index(df_idx_multi_index)
df_list = [df for _, df in df_dict.items()]
comb_df = pd.concat(df_list)
This would result in:
0
0 df1 10
1 df1 20
0 df2 100
1 df2 200
2 df2 300
0 df3 1000
1 df3 2000
2 df3 3000
3 df3 4000
In order to access each item, you could you use .loc from pandas, for example:
>>> comb_df.loc[0, 'df2']
0 100
Name: (0, df2), dtype: int64
I want to save a pandas pivot table for human reading, but DataFrame.to_csv doesn't include the DataFrame.columns.name. How can I do that?
Example:
For the following pivot table:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [6, 7, 8]])
>>> df.columns = list("ABC")
>>> df.index = list("XY")
>>> df
A B C
X 1 2 3
Y 6 7 8
>>> p = pd.pivot_table(data=df, index="A", columns="B", values="C")
When viewing the pivot table, we have both the index name ("A"), and the columns name ("B").
>>> p
B 2 7
A
1 3.0 NaN
6 NaN 8.0
But when exporting as a csv we lose the columns name:
>>> p.to_csv("temp.csv")
===temp.csv===
A,2,7
1,3.0,
6,,8.0
How can I get some kind of human-readable output format which contains the whole of the pivot table, including the .columns.name ("B")?
Something like this would be fine:
B,2,7
A,,
1,3.0,
6,,8.0
Yes, it is possible by append helper DataFrame, but reading file is a bit complicated:
p1 = pd.DataFrame(columns=p.columns, index=[p.index.name]).append(p)
p1.to_csv('temp.csv',index_label=p.columns.name)
B,2,7
A,,
1,3.0,
6,,8.0
#set first column to index
df = pd.read_csv('temp.csv', index_col=0)
#set columns and index names
df.columns.name = df.index.name
df.index.name = df.index[0]
#remove first row of data
df = df.iloc[1:]
print (df)
B 2 7
A
1 3.0 NaN
6 NaN 8.0