Creating Multi Indexed Dataframe with Its Name - Pandas/Python - python

Assuming I have this dataframe saved in my directory:
import numpy as np
import pandas as pd
col1 = pd.Series(np.linspace(1, 10, 20))
col2 = pd.Series(np.linspace(11, 20, 20))
data = np.array([col1, col2]).T
df = pd.DataFrame(data, columns = ["col1", "col2"])
df.to_csv("test.csv", index = False)
What I would like to do is to read this file and the name of the file as a column on top of the other columns to get something like this:
How can I do this?

Use pathlib to extract the file name using .stem and pd.concat to create a multi level column:
import pathlib
filename = pathlib.Path('path/to/test.csv')
df = pd.concat({filename.stem.capitalize(): pd.read_csv(filename)}, axis=1)
print(df)
# Output:
Test
col1 col2
0 1.000000 11.000000
1 1.473684 11.473684
2 1.947368 11.947368
3 2.421053 12.421053
4 2.894737 12.894737
5 3.368421 13.368421
6 3.842105 13.842105
7 4.315789 14.315789
8 4.789474 14.789474
9 5.263158 15.263158
10 5.736842 15.736842
11 6.210526 16.210526
12 6.684211 16.684211
13 7.157895 17.157895
14 7.631579 17.631579
15 8.105263 18.105263
16 8.578947 18.578947
17 9.052632 19.052632
18 9.526316 19.526316
19 10.000000 20.000000

Use MultiIndex.from_product:
file = 'test.csv'
df = pd.read_csv(file)
name = file.split('.')[0].capitalize()
df.columns= pd.MultiIndex.from_product([[name],df.columns])
print (df.head())
Test
col1 col2
0 1.000000 11.000000
1 1.473684 11.473684
2 1.947368 11.947368
3 2.421053 12.421053
4 2.894737 12.894737

Related

Pandas find common NA records across multiple large dataframes

I have 3 dataframes like as shown below
ID,col1,col2
1,X,35
2,X,37
3,nan,32
4,nan,34
5,X,21
df1 = pd.read_clipboard(sep=',',skipinitialspace=True)
ID,col1,col2
1,nan,305
2,X,307
3,X,302
4,nan,304
5,X,201
df2 = pd.read_clipboard(sep=',',skipinitialspace=True)
ID,col1,col2
1,X,315
2,nan,317
3,X,312
4,nan,314
5,X,21
df3 = pd.read_clipboard(sep=',',skipinitialspace=True)
Now I want to identify the IDs where col1 is NA in all 3 input dataframes.
So, I tried the below
L1=df1[df1['col1'].isna()]['ID'].tolist()
L2=df2[df2['col1'].isna()]['ID'].tolist()
L3=df3[df3['col1'].isna()]['ID'].tolist()
common_ids_all = list(set.intersection(*map(set, [L1,L2,L3])))
final_df = pd.concat([df1,df2,df3],ignore_index=True)
final_df[final_df['ID'].isin(common_ids_all)]
While the above works, is there any efficient and elegant approach do the above?
As you can see that am repeating the same statement thrice (for 3 dataframes)
However, in my real data, I have 12 dataframes where I have to get IDs where col1 is NA in all 12 dataframes.
update - my current read operation looks like below
fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
dfs=[]
NA_list=[]
def preprocessing(fname):
df= pd.read_excel(fname, sheet_name="Sheet1")
df.columns = df.iloc[7]
df = df.iloc[8: , :]
NA_list.append(df[df['col1'].isna()]['ID'])
dfs.append(df)
[preprocessing(fname) for fname in fnames]
final_df = pd.concat(dfs, ignore_index=True)
L1 = NA_list[0]
L2 = NA_list[1]
L3 = NA_list[2]
final_list = (list(set.intersection(*map(set, [L1,L2,L3]))))
final_df[final_df['ID'].isin(final_list)]
You can use:
dfs = [df1, df2, df3]
final_df = pd.concat(dfs).query('col1.isna()')
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
print(final_df)
# Output
ID col1 col2
3 4 NaN 34
3 4 NaN 304
3 4 NaN 314
Full code:
fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
def preprocessing(fname):
return pd.read_excel(fname, sheet_name='Sheet1', skiprows=6)
dfs = [preprocessing(fname) for fname in fnames]
final_df = pd.concat([df[df['col1'].isna()] for df in dfs])
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
This are times when def function get you sorted. If the dataframe list will continually change I will create a def function. If I got you right the following will do;
def CombinedNaNs(lst):
newdflist =[]
for d in dflist:
newdflist.append(d[d['col1'].isna()])
s=pd.concat(newdflist)
return s[s.duplicated(subset=['ID'], keep=False)].drop_duplicates()
dflist=[df1,df2,df3]#List of dfs
CombinedNaNs(dflist)#apply function
ID col1 col2
3 4 NaN 34
3 4 NaN 304
3 4 NaN 314

Python - extract a text from string after the initial extraction of the number

My little school project gets a bit more complicated hence I am looking for a help. I have a data from with some text in Col1 that I use to extract the max numeric value. Now, I have been trying to use that extracted number to extract with any characters before and after it (from space for pre-fix and until space for suffix).
Here is my code:
from numpy import floor, int64
from numpy.core import numeric
import pandas as pd
data = [['aaa', 10], ['nick12 text 1 a 1000a', 15], ['juli078 aq 199 299-01 aaa', 14]]
df = pd.DataFrame(data, columns = ['col1', 'col2'])
print(df.dtypes)
pat = (r'(\d+(?:\.\d+)?)')
df['Number'] = df['col1'].str.extractall(pat).astype(int).max(level=0)
df['Number'] = df['Number'].fillna(0)
df['Number'] = df['Number'].astype(int)
print(df.dtypes)
print(df)
I want to add another column NumberText so my final result looks like this:
col1 col2 Number NumberText
0 aaa 10 0
1 nick12 text 1 a 1000a 15 1000 1000a
2 juli078 aq 199 299-01 aaa 14 299 299-01
You can try:
df['NumberText'] = df.apply(lambda x: ' '.join([word if str(x['Number']) in word else '' for word in x['col1'].split(' ')]).strip(), axis=1)
Output:
col1 col2 Number NumberText
0 aaa 10 0
1 nick12 text 1 a 1000a 15 1000 1000a
2 juli078 aq 199 299-01 aaa 14 299 299-01

How to read data from url to pandas dataframe

I'm trying to read data from https://download.bls.gov/pub/time.series/ee/ee.industry using pandas, like this:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
df = pd.read_csv(url, sep='\t')
Also tried getting the separator:
import pandas as pd
url = 'https://download.bls.gov/pub/time.series/ee/ee.industry'
reader = pd.read_csv(url, sep = None, iterator = True)
inferred_sep = reader._engine.data.dialect.delimiter
df = pd.read_csv(url, sep=inferred_sep)
However the data is not very weelk formated, the columns of the dataframe are right:
>>> df.columns
Index(['industry_code', 'SIC_code', 'publishing_status', 'industry_name'], dtype='object')
But the data does not correspond to the columns, it seems all the data is merged into the fisrt two columns and the last two do not have any data.
Any suggestion/idea on a better approach to fet this data?
EDIT
The expexted result should be something like:
industry_code
SIC_code
publishing_status
industry_name
000000
N/A
B
Total nonfarm 1 T 1
The reader works well but you don’t have the right number of columns in your header. You can get the other columns back using .reset_index() and then rename the columns:
>>> df = pd.read_csv(url, sep='\t')
>>> n_missing_headers = df.index.nlevels
>>> cols = df.columns.to_list() + [f'col{n}' for n in range(n_missing_headers)]
>>> df.reset_index(inplace=True)
>>> df.columns = cols
>>> df.head()
industry_code SIC_code publishing_status industry_name col0 col1 col2
0 0 NaN B Total nonfarm 1 T 1
1 5000 NaN A Total private 1 T 2
2 5100 NaN A Goods-producing 1 T 3
3 100000 10-14 A Mining 2 T 4
4 101000 10 A Metal mining 3 T 5
You can then keep the first 4 columns if you want:
>>> df.iloc[:, :-n_missing_headers].head()
industry_code SIC_code publishing_status industry_name
0 0 NaN B Total nonfarm
1 5000 NaN A Total private
2 5100 NaN A Goods-producing
3 100000 10-14 A Mining
4 101000 10 A Metal mining

unable to grab particular columns from a CSV file

I am trying to access contents of a CSV file and parse it. I just need two columns out of entire CSV file . I can access the CSV and its contents but I need to limit it to the columns I need so that I can use the details from that columns
import os
import boto3
import pandas as pd
import sys
from io import StringIO # Python 3.x
session = boto3.session.Session(profile_name="rli-prod",region_name="us-east-1")
client = session.client("s3")
bucket_name = 'bucketname'
object_key = 'XX/YY/ZZ.csv'
csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
body = csv_obj['Body']
csv_string = body.read().decode('utf-8-sig')
df = pd.read_csv(StringIO(csv_string))
print(df)
Right now, I am getting the entire CSV. Below is the output
0 63a2a854-a136-4bb1-a89b-a4e638b2be14 8128639b-a163-4e8e-b1f8-22e3dcd2b655 ... 123 63a2a854-a136-4bb1-a89b-a4e638b2be14
1 63a2a854-a136-4bb1-a89b-a4e638b2be14 8d6bdc73-f908-45d8-8d8a-c3ac0bee3b29 ... 123 63a2a854-a136-4bb1-a89b-a4e638b2be14
2 63a2a854-a136-4bb1-a89b-a4e638b2be14 1312e6f6-4c5f-4fa5-babd-93a3c0d3b502 ... 234 63a2a854-a136-4bb1-a89b-a4e638b2be14
3 63a2a854-a136-4bb1-a89b-a4e638b2be14 bfec5ccc-4449-401d-9898-9c523b1e1230 ... 456 63a2a854-a136-4bb1-a89b-a4e638b2be14
4 63a2a854-a136-4bb1-a89b-a4e638b2be14 522a72f0-2746-417c-9a59-fae4fb1e07d7 ... 567 63a2a854-a136-4bb1-a89b-a4e638b2be14
[5 rows x 9 columns]
Right now, My CSV doesnot have any headers , so only option I have is to grab using column number. But am not sure how to do that? Can anyone please assist?
Option 1:
If you already read the csv and want to do the dropping of other columns mid calculation. Use the index of which columns you want to use inside df.iloc.
Example:
>>> df #sample dataframe I want to get the first 2 columns only
Artist Count Test
0 The Beatles 4 1
1 Some Artist 2 1
2 Some Artist 2 1
3 The Beatles 4 1
4 The Beatles 4 1
5 The Beatles 4 1
>>> df3 = df.iloc[:,[0,1]]
>>> df3
Artist Count
0 The Beatles 4
1 Some Artist 2
2 Some Artist 2
3 The Beatles 4
4 The Beatles 4
5 The Beatles 4
Option 2
During the reading of the file itself, specify which columns to use under the parameter usecols of read_csv().
df = pd.read_csv(StringIO(csv_string), usecols = [place column index here])
strong textUse read_csv method from pandas library:
import pandas as pd
data = pd.read_csv('file.csv', usecols=[2, 4])
print(data.head())
The parameter usecols accepts the name of the column or index as a list
Since you are already utilizing the Pandas library, you should be able to accomplish this by passing the header= argument to the read_csv method like so:
# will pull columns indexed [0,2,4]
df = pd.read_csv(StringIO(csv_string), header=[0,2,4])
From the docs: ... The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped)...
In [15]: import pandas as pd
In [16]: d1 = {"col1" : "value11", "col2": "value21", "col3": "value31"}
In [17]: d2 = {"col1" : "value12", "col2": "value22", "col3": "value32"}
In [18]: d3 = {"col1" : "value13", "col2": "value23", "col3": "value33"}
In [19]: df = df.append(d1, ignore_index=True, verify_integrity=True, sort=False)
In [20]: df = df.append(d2, ignore_index=True, verify_integrity=True, sort=False)
In [21]: df = df.append(d3, ignore_index=True, verify_integrity=True, sort=False)
In [22]: df
Out[22]:
col1 col2 col3
0 value11 value21 value31
1 value12 value22 value32
2 value13 value23 value33
3 value11 value21 value31
4 value12 value22 value32
5 value13 value23 value33
In [23]: # Selecting only col1 and col3
In [24]: df_new = df[["col1", "col3"]]
In [25]: df_new
Out[25]:
col1 col3
0 value11 value31
1 value12 value32
2 value13 value33
3 value11 value31
4 value12 value32
5 value13 value33
In [26]:

How to combine summary statistics from 100s of *csv files into one *csv with pandas?

I have several hundreds *csv files, which when imported into a pandas data frame look as follows:
import pandas as pd
df = pd.read_csv("filename1.csv")
df
column1 column2 column3 column4
0 10 A 1 ID1
1 15 A 1 ID1
2 19 B 1 ID1
3 5071 B 0 ID1
4 5891 B 0 ID1
5 3210 B 0 ID1
6 12 B 2 ID1
7 13 C 2 ID1
8 20 C 0 ID1
9 5 C 3 ID1
10 9 C 3 ID1
Each *csv file has a unique ID for column4 (whereby each row has the same element).
I would like to create a new csv file, whereby each filename is a row, keeping the ID/value from column4 and the max values of column1 and column3. What is the best pandas way to do this?
ID1 5891 3
....
My idea would be:
import numpy as np
import pandas as pd
files = glob.glob("*.csv") # within the correct subdirectory
newdf1 = pd.DataFrame()
for file in newdf1:
df = pd.read_csv(file)
df["ID"] = df.column4.unique()
df["max_column1"] = df.column2.max()
df["max_column3"] = df.column3.max()
newdf1 = newdf1.append(df, ignore_index=True)
newdf1.to_csv("totalfile.csv")
However, (1) I don't know if this is efficient and (2) I don't know if the dimensions of the final csv is correct. ALSO, how would one deal with a *csv that was missing a column1 or column3? That is, it should "pass" these values.
What is the correct way to do this?
I think you can loop by files, get first value by iat and max and append to list.
Then use DataFrame constructor and write to file.
files = glob.glob("*.csv") # within the correct subdirectory
L = []
for file in files:
df = pd.read_csv(file)
u = df.column4.iat[0]
m1 = df.column1.max()
m2 = df.column3.max()
L.append({'ID':u,'max_column1':m1,'max_column3':m2})
newdf1 = pd.DataFrame(L)
newdf1.to_csv("totalfile.csv")
EDIT:
L = []
for file in files:
print (file)
df = pd.read_csv(file)
#print (df)
m1, m2 = np.nan, np.nan
if df.columns.str.contains('column1').any():
m1 = df.column1.max()
if df.columns.str.contains('column3').any():
m2 = df.column3.max()
u = df.column4.iat[0]
L.append({'ID':u,'max_column1':m1,'max_column3':m2})
newdf1 = pd.DataFrame(L)
Repeated appending to a pandas DataFrame is highly inefficient as it copies the DataFrame.
Instead you could write the max values found to the resultant file directly.
files = glob.glob("*.csv")
with open("totalfile.csv", "w") as fout:
for f in files:
df = pd.read_csv(f)
result = df.loc[:, ['column4', 'column2', 'column3']].max()\
.fillna('pass').to_dict()
fout.write("{column4},{column2},{column3}\n".format(**result))
df.loc[:, ['column4', 'column2', 'column3']] would return NaN filled columns for missing columns. This would raise exception only when all three columns are missing.
fill_na('pass') will substitute missing values.

Categories

Resources