Question:
How can I split a list into two sublists where the elements are separated by a tab in the element?
Context:
I want to read a .txt file delimited by tabs into a Pandas DataFrame. The files look something like:
Column1 \t 123
Column2 \t
Column3 \t text
Meaning that each line has one column followed by one tab and then one value of the column (sometimes no value).
My idea was to read the file and save each line as an element of a list, then split the list into two keeping the first part before the tab as one list and the second part after the tab as another. Then build my dataframe from there.
for file in txt_files: #iterate over all files
f = open(file) #open each file individually
lines = f.readlines() #read each line as an element into a list
f.close()
#make sublists columns and values
You can read your files into a dataframe like this:
import pandas as pd
# Empty list to store dataframe rows
df_rows = []
# Read all text files
for tf in text_files:
# For each file
with open(tf) as f:
# Empty dictionary to store column names and values
df_dict = {}
# For each line
for line in f:
# Split by tab
k, v = line.split('\t')
# Column name as key, value as value
df_dict[k] = v
# Add the dictionary to list
df_rows.append(df_dict)
# Read the list of dictionaries as a dataframe
df = pd.DataFrame(df_rows)
# Preview dataframe
df.head()
If I understand correctly, you can just transpose the dataframe read_csv will give you with delimiter='\t'.
Demo:
>>> from io import StringIO
>>> import pandas as pd
>>>
>>> file = StringIO('''Column1\t123
...: Column2\t
...: Column3\ttext''')
>>>
>>> df = pd.read_csv(file, delimiter='\t', index_col=0, header=None).T
>>> df
>>>
0 Column1 Column2 Column3
1 123 NaN text
(If your delimiter is really ' \t ' then use delimiter=' \t ' and engine='python').
Related
I want to read text file. The file is like this:
17430147 17277121 17767569 17352501 17567841 17650342 17572001
I want the result:
17430147
17277121
17767569
17352501
17567841
17650342
17572001
So, i try some codes:
data = pd.read_csv('train.txt', header=None, delimiter=r"\s+")
or
data = pd.read_csv('train.txt', header=None, delim_whitespace=True)
From those codes, the error like this:
ParserError: Too many columns specified: expected 75262 and found 154
Then i try this code:
file = open("train.txt", "r")
data = []
for i in file:
i = i.replace("\n", "")
data.append(i.split(" "))
But i think there are missing value in txt file:
'2847',
'2848',
'2849',
'1947',
'2850',
'2851',
'2729',
''],
['2852',
'2853',
'2036',
Thank you!
The first step would be to read the text file as a string of values.
with open('train.txt','r') as f:
lines = f.readlines()
list_of_values = lines[0].split(' ')
Here, list_of_values looks like:
['17430147',
'17277121',
'17767569',
'17352501',
'17567841',
'17650342',
'17572001']
Now, to create a DataFrame out of this list, simply execute:
import pandas as pd
pd.DataFrame(list_of_values)
This will give a pandas DataFrame with a single column with values read from the text file.
If only different values that exist in the text file are required to be obtained, then the list list_of_values can be directly used.
You can use .T method to transpose your dataframe.
data = pd.read_csv("train.txt", delim_whitespace=True).T
How do I create a dataframe from a string that look like this (part of the string)
,file_05,,\r\nx data,y
data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,
Trying to make a dataframe that look like this
In the code below s is the string:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(s)).dropna(axis=1)
df.rename(columns={df.columns[0]: ""}, inplace=True)
By the way, if the string comes from a csv file then it is simpler to read the file directly using pd.read_csv.
Edit: This code will create a multiindex of columns:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(s), header = None).dropna(how="all", axis=1).T
df[0] = df.loc[1, 0]
df = df.set_index([0, 1]).T
Looks like you want a multi-level dataframe from the string. Here's how I would do it.
Step 1: Split the string by '\r\n'. Then for each value, split by
','
Step 2: The above step will create a list of list. Element #0 has 4
items and element #1 has 2 items. The rest have 3 items each and is
the actual data
Step 3: Convert the data into a dictionary from element #3 onwards.
Use values in element #2 as keys for the dictionary (namely x data
and y data). To ensure you have key:[list of values], use the
dict.setdefault(key,[]).append(value). This will ensure the data
is created as a `key:[list of values]' dictionary.
Step 4: Create a normal dataframe using the dictionary as all the
values are stored as key and values in the dictionary.
Step 5: Now that you have the dictionary, you want to create the
MultiIndex. Convert the column to MultiIndex.
Putting all this together, the code is:
import pandas as pd
text = ',file_05,,\r\nx data,y data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,'
line_text = [txt.split(',') for txt in text.split('\r\n')]
dct = {}
for x,y,z in line_text[2:]:
dct.setdefault(line_text[1][0], []).append(x)
dct.setdefault(line_text[1][1], []).append(y)
df = pd.DataFrame(dct)
df.columns = pd.MultiIndex.from_tuples([(line_text[0][i],line_text[1][i]) for i in [0,1]])
print (df)
Output of this will be:
file_05
x data y data
0 -970.0 -34.12164
1 -959.0 -32.37526
2 -949.0 -30.360199
3 -938.0 -28.74816
4 -929.0 -27.53912
5 -920.0 -25.92707
6 -911.0 -24.31503
7 -900.0 -23.64334
8 -891.0 -22.29997
You should convert your raw data to a table with python.
Save to csv file by import csv package with python.
from pandas import DataFrame
# s is raw datas
s = ",file_05,,\r\nx data,y data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,"
# convert raw data to a table
table = [i.split(',') for i in s.split("\r\n")]
table = [i[:2] for i in table]
# table is like
"""
[['', 'file_05'],
['x data', 'y data'],
['-970.0', '-34.12164'],
['-959.0', '-32.37526'],
['-949.0', '-30.360199'],
...
['-891.0', '-22.29997']]
"""
# save to output.csv file
import csv
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(table)
# Save to DataFrame df
from pandas import DataFrame
df = DataFrame (table[2:],columns=table[1][:2])
print(df)
I have a .csv file that is split in sections, each starting with < string > on a row of its own as in this example. This is followed by a set of columns and their respective rows of values. Columns are not consistent between sections.
< section1 ><br>
col1 col2 col3<br>
val1 val2 val3
< section2 ><br>
col3 col4 col5<br>
val4 val5 val6<br>
val7 val8 val9
...etc. Is there a way in which I can, either when the file's in .txt or .csv, import each section either:
1) into seperate dataframes?
2) into the same dataframe, but something like df[section][col]?
Many thanks!
Depending on the size of your csv, you could read in the entire file into Pandas and split the dataframe into multiple dataframes via a list comprehension.
data = '''<Network>;;;;;;;;;;;;;;;;;;;;;
Property;Value;;;;;;;;;;;;;;;;;;;;
Title;;;;;;;;;;;;;;;;;;;;;
Version;6.4;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;
<Sites>;;;;;;;;;;;;;;;;;;;;;
Name;LocationCode;Longitude;Latitude;;;;;;;;;;...'''
df = pd.read_csv(StringIO(data), header=None)
create a list of dataframe names (the headers of each df)
df_names = df[0].str.extract(r'(<[a-zA-Z]+>)')[0].str.strip('<>').dropna().tolist()
find the indices for the headers
regions = df.loc[df[0].str.contains(r'<[a-zA-Z]+')].index.tolist()
last_row = df.index[-1]
regions.append(last_row)
from more_itertools import windowed
create windows for each 'sub' dataframe
regions_window = list(windowed(regions,2))
the function helps with some cleanup during the dataframe extraction
def some_cleanup(df):
df.columns = df.iloc[0].str.extract(r'(<[a-zA-z]+>)')[0].str.strip('<>')
df = df.iloc[1:]
return df
extract the dataframes
M = [df.loc[start:end].pipe(some_cleanup) for start,end in regions_window]
create a dict with the keys as the dataframe names
dataframe_dict = dict(zip(df_names,M))
I think you can take simple approach and read txt file like:
with open("dummy.txt") as f:
lines = f.readlines()
Now just get the location of each section:
sections = [lines.index(line) for line in lines if "<" in line]
Then you can use sections to read in between data in pandas dataframe like:
for i in range(len(sections)):
header = lines[sections[i]]
df = pd.DataFrame(lines[sections[i]+1:sections[i+1]],
columns=header)
print(df.head())
There are some great answers here already but I'd recommend a Unix tool! It is shorter and will scale to very large files that don't fit into Pandas.
Assuming your file is called foo.csv:
awk '/< section/{x=i++"foo_mini";next}{print > x;}' foo.csv
Creates as many (numbered) {n}foo_mini.csv files as you have sections. (It seeks the pattern < section, and then starts a new file from the following line.)
Then for completeness' sake, add the csv extension:
for file in *foo_mini; do mv "$file" "${file/foo_mini/foo_mini.csv}"; done
You thus have:
0foo_mini.csv
1foo_mini.csv
etc...
It's then a cinch to read them in with Pandas as separate dataframes, and concat them if you like.
I'd do something like this:
import re
import pandas as pd
new_section = False
header_read = False
data_for_frame = list()
for row in data.splitlines():
if row.startswith('< '):
new_section = True
continue
if re.match('^\s*$', row):
new_section = False
header_read = False
df = pd.DataFrame(data_for_frame, columns=columns)
continue
if new_section:
if not header_read:
columns = row.split(' ')
header_read = True
continue
if header_read:
data_for_frame.append(row.split(' '))
continue
Import might be only, that the CSV file ends with an empty line as well. And you have to take care about the dataframe naming.
The data.splitlines() just came from my own short test, you have to replace it with with open('myfile','r) as f:and so on.
I want to read the first 3 columns of a csv file and do some modification before storing them.
Data in csv file:
{::[name]str1_str2_str3[0]},1,U0.00 - Sensor1 Not Ready\nTry Again,1,0,12
{::[name]str1_str2_str3[1]},2,U0.00 - Sensor2 Not Ready\nTry Again,1,0,12
From the column1, I just want to parse the value 0 or 1 within the [ ].
Then the value in column2
From column3, I want to parse the substring "Sensor1 Not Ready". Then convert to upper case and replace the space with underscore (eg - SENSOR1_NOT_READY). And then print the string in a new column.
Parsing format -
**<value from column 1>.<value from column 2>.<string from column 3>**
I am new to coding in Python. Can someone help me with this? What is the best and the most efficient way to do this?
TIA
What I have tried so far -
import csv
from collections import defaultdict
columns = defaultdict(list)
with open('filename.csv','rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
for i in range(len(row)):
columns[i].append(row[i])
columns = dict(columns)
Is this a good way for Column 3?
x = # Parsed data from Column 3'
a, b = x.split("\n") # 'a' denotes the substring before \n
c, d = a.split("-") # 'd' denotes the substring after '-'
e = d.upper()
new_str = str.replace(" ", "_")
print new_str
My suggestion is to read a whole line as a string, and then extract desired data with re module like this:
import re
term = '\[(\d)\].*,(\d+),.*-\s([\w\s]+)\\n'
line = '{::[name]str1_str2_str3[0]},1,U0.00 - Sensor1 Not Ready\nTry Again,1,0,12'
capture = list(re.search(term, line).groups())
capture[-1] = '_'.join(capture[-1].split()).upper()
result = ','.join(capture)
#0,1,Sensor1_Not_Ready
I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')