I have a .csv file that is split in sections, each starting with < string > on a row of its own as in this example. This is followed by a set of columns and their respective rows of values. Columns are not consistent between sections.
< section1 ><br>
col1 col2 col3<br>
val1 val2 val3
< section2 ><br>
col3 col4 col5<br>
val4 val5 val6<br>
val7 val8 val9
...etc. Is there a way in which I can, either when the file's in .txt or .csv, import each section either:
1) into seperate dataframes?
2) into the same dataframe, but something like df[section][col]?
Many thanks!
Depending on the size of your csv, you could read in the entire file into Pandas and split the dataframe into multiple dataframes via a list comprehension.
data = '''<Network>;;;;;;;;;;;;;;;;;;;;;
Property;Value;;;;;;;;;;;;;;;;;;;;
Title;;;;;;;;;;;;;;;;;;;;;
Version;6.4;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;
<Sites>;;;;;;;;;;;;;;;;;;;;;
Name;LocationCode;Longitude;Latitude;;;;;;;;;;...'''
df = pd.read_csv(StringIO(data), header=None)
create a list of dataframe names (the headers of each df)
df_names = df[0].str.extract(r'(<[a-zA-Z]+>)')[0].str.strip('<>').dropna().tolist()
find the indices for the headers
regions = df.loc[df[0].str.contains(r'<[a-zA-Z]+')].index.tolist()
last_row = df.index[-1]
regions.append(last_row)
from more_itertools import windowed
create windows for each 'sub' dataframe
regions_window = list(windowed(regions,2))
the function helps with some cleanup during the dataframe extraction
def some_cleanup(df):
df.columns = df.iloc[0].str.extract(r'(<[a-zA-z]+>)')[0].str.strip('<>')
df = df.iloc[1:]
return df
extract the dataframes
M = [df.loc[start:end].pipe(some_cleanup) for start,end in regions_window]
create a dict with the keys as the dataframe names
dataframe_dict = dict(zip(df_names,M))
I think you can take simple approach and read txt file like:
with open("dummy.txt") as f:
lines = f.readlines()
Now just get the location of each section:
sections = [lines.index(line) for line in lines if "<" in line]
Then you can use sections to read in between data in pandas dataframe like:
for i in range(len(sections)):
header = lines[sections[i]]
df = pd.DataFrame(lines[sections[i]+1:sections[i+1]],
columns=header)
print(df.head())
There are some great answers here already but I'd recommend a Unix tool! It is shorter and will scale to very large files that don't fit into Pandas.
Assuming your file is called foo.csv:
awk '/< section/{x=i++"foo_mini";next}{print > x;}' foo.csv
Creates as many (numbered) {n}foo_mini.csv files as you have sections. (It seeks the pattern < section, and then starts a new file from the following line.)
Then for completeness' sake, add the csv extension:
for file in *foo_mini; do mv "$file" "${file/foo_mini/foo_mini.csv}"; done
You thus have:
0foo_mini.csv
1foo_mini.csv
etc...
It's then a cinch to read them in with Pandas as separate dataframes, and concat them if you like.
I'd do something like this:
import re
import pandas as pd
new_section = False
header_read = False
data_for_frame = list()
for row in data.splitlines():
if row.startswith('< '):
new_section = True
continue
if re.match('^\s*$', row):
new_section = False
header_read = False
df = pd.DataFrame(data_for_frame, columns=columns)
continue
if new_section:
if not header_read:
columns = row.split(' ')
header_read = True
continue
if header_read:
data_for_frame.append(row.split(' '))
continue
Import might be only, that the CSV file ends with an empty line as well. And you have to take care about the dataframe naming.
The data.splitlines() just came from my own short test, you have to replace it with with open('myfile','r) as f:and so on.
Related
I am trying to load a table from a txt file, but I want to start loading from a certain word
in this case this is the file and I want to start from the numbers beneath the sentence >>>>Begin....
I know about the skiprows command but not all tables start at the same line
thanks
Maybe is not super efficient way to do this but i try to filter necessary data and append to df using below script:
import re
import os
import pandas as pd
def foo(file_name):
# create empty df
df = pd.DataFrame(columns=list('ab'))
pat = r'>+[a-zA-Z ]*<+'
pat2 = r'[-0-9.]*'
start_save_to_df = False
# set path
with open(os.path.join(os.getcwd(),'src',file_name)) as f:
for row in f.readlines():
if start_save_to_df:
val1, val2 = [float(val) for val in re.findall(pat2, row) if val]
# append data
df = df.append({'a': val1, 'b': val2}, ignore_index=True)
if re.search(pat, row):
start_save_to_df = True
return df
I hope it's helps you.
We have this if else iteration with the goal to split a dataframe into several dataframes. The result of this iteration will vary, so we will not know how much dataframes we will get out of a dataframe. We want to save that several dataframe as text (.txt):
txtDf = open('D:/My_directory/df0.txt', 'w')
txtDf.write(df0)
txtDf.close()
txtDf = open('D:/My_directory/df1.txt', 'w')
txtDf.write(df0)
txtDf.close()
txtDf = open('D:/My_directory/df2.txt', 'w')
txtDf.write(df0)
txtDf.close()
And so on ....
But, we want to save that several dataframes automatically, so that we don't need to write the code above for 100 times because of 100 splitted-dataframes.
This is the example our dataframe df:
column_df
237814
1249823
89176812
89634
976234
98634
and we would like to split the dataframe df to several df0, df1, df2 (notes: each column will be in their own dataframe, not in one dataframe):
column_df0 column_df1 column_df2
237814 89176812 976234
1249823 89634 98634
We tried this code:
import copy
import numpy as np
df= pd.DataFrame(df)
len(df)
if len(df) > 10:
print('EXCEEEEEEEEEEEEEEEEEEEEDDD!!!')
sys.exit()
elif len(df) > 2:
df_dict = {}
x=0
y=2
for df_letter in ['A','B','C','D','E','F']:
df_name = f'df_{df_letter}'
df_dict[df_name] = copy.deepcopy(df_letter)
df_dict[df_name] = pd.DataFrame(df[x:y]).to_string(header=False, index=False, index_names=False).split('\n ')
df_dict[df_name] = [','.join(ele.split()) for ele in df_dict[df_name]]
x += 2
y += 2
df_name
else:
df
for df_ in df_dict:
print(df_)
print(f'length: {len(df_dict[df_])}')
txtDf = open('D:/My_directory/{df_dict[df_]}.txt', 'w')
txtDf.write(df)
txtDf.close()
The problem with this code is that we cannot write several .txt files automatically, everything else works just fine. Can anybody figure it out?
If it is a list then you can iterate through it and save each element as string
import os
for key, value in df_dict.items():
with open(f'D:/My_directory/{key}.txt', "w") as file:
file.write('\n'.join(str(v) for v in value))
I have a big excel sheet with information about different companies altogether in a single cell for each company and my goal is to separate this into different columns following patterns to scrape the info from the first column. The original data looks like this:
My goal is to achieve a dataframe like this:
I have created the following code to use the patterns Mr., Affiliation:, E-mail:, and Mobile because they are repeated in every single row the same way. However, I don't know how to use the findall() function to scrape all the info I want from each row of the desired column.
import openpyxl
import re
import sys
import pandas as pd
reload(sys)
sys.setdefaultencoding('utf8')
wb = openpyxl.load_workbook('/Users/ap/info1.xlsx')
ws = wb.get_sheet_by_name('Companies')
w={'Name': [],'Affiliation': [], 'Email':[]}
for row in ws.iter_rows('C{}:C{}'.format(ws.min_row,ws.max_row)):
for cells in row:
a=re.findall(r'Mr.(.*?)Affiliation:',aa, re.DOTALL)
a1="".join(a).replace('\n',' ')
b=re.findall(r'Affiliation:(.*?)E-mail',aa,re.DOTALL)
b1="".join(b).replace('\n',' ')
c=re.findall(r'E-mail(.*?)Mobile',aa,re.DOTALL)
c1="".join(c).replace('\n',' ')
w['Name'].append(q1)
w['Affiliation'].append(r1)
w['Email'].append(s1)
print cell.value
df=pd.DataFrame(data=w)
df.to_excel(r'/Users/ap/info2.xlsx')
I would go with this, which just replaces the 'E-mail:...' with a delimiter and then splits and assigns to the right column
df['Name'] = np.nan
df['Affiliation'] = np.nan
df['Email'] = np.nan
df['Mobile'] = np.nan
for i in range(0, len(df)):
full_value = df['Companies'].loc[i]
full_value = full_value.replace('Affiliation:', ';').replace('E-mail:', ';').replace('Mobile:', ';')
full_value = full_value.split(';')
df['Name'].loc[i] = full_value[0]
df['Affiliation'].loc[i] = full_value[1]
df['Email'].loc[i] = full_value[2]
df['Mobile'].loc[i] = full_value[3]
del df['Companies']
print(df)
I have the following csv file:
csv file
there are about 6-8 rows at the top of the file, I know how to make a new dataframe in Pandas, and filter the data:
df = pd.read_csv('payments.csv')
df = df[df["type"] == "Order"]
print df.groupby('sku').size()
df = df[df["marketplace"] == "amazon.com"]
print df.groupby('sku').size()
df = df[df["promotional rebates"] > ((df["product sales"] + df["shipping credits"])*-.25)]
print df.groupby('sku').size()
df.to_csv("out.csv")
My issue is with the Headers. I need to
1. look for the row that has date/time & another field.
That way I do not have to change my code if the file keeps changing the row count before the headers.
2. make a new DF excluding those rows.
What is the best approach, to make sure the code does not break to changes as long as the header row exist and has a few Fields matching. Open for any suggestions.
considering a CSV file like this:
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
You can use the following to compute the header's line number:
#load the first 20 rows of the csv file as a one column dataframe
#to look for the header
df = pd.read_csv("csv_file.csv", sep="|", header=None, nrows=20)
# use a regular expression to look check which column has the header
# the following will generate a array of booleans
# with True if the row contains the regex "datetime.+settelment id.+type"
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
# get the row index of the header
header_index = df[indices].index.values[0]
and read the csv file starting from the header's index:
# to read the csv file, use the following:
df = pd.read_csv("csv_file.csv", skiprows=header_index+1)
Reproducible example:
import pandas as pd
from StringIO import StringIO
st = """
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
"""
df = pd.read_csv(StringIO(st), sep="|", header=None, nrows=20)
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
header_index = df[indices].index.values[0]
df = pd.read_csv(StringIO(st), skiprows=header_index+1)
print(df)
print("columns")
print(df.columns)
print("shape")
print(df.shape)
Output:
datetime settelment id type
0 dd dd dd
columns
Index([u'datetime', u' settelment id', u' type'], dtype='object')
shape
(1, 3)
I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')