Create dataframe from a string Python

Create dataframe from a string Python - python

How do I create a dataframe from a string that look like this (part of the string)
,file_05,,\r\nx data,y
data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,
Trying to make a dataframe that look like this

In the code below s is the string:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(s)).dropna(axis=1)
df.rename(columns={df.columns[0]: ""}, inplace=True)
By the way, if the string comes from a csv file then it is simpler to read the file directly using pd.read_csv.
Edit: This code will create a multiindex of columns:
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(s), header = None).dropna(how="all", axis=1).T
df[0] = df.loc[1, 0]
df = df.set_index([0, 1]).T

Looks like you want a multi-level dataframe from the string. Here's how I would do it.
Step 1: Split the string by '\r\n'. Then for each value, split by
','
Step 2: The above step will create a list of list. Element #0 has 4
items and element #1 has 2 items. The rest have 3 items each and is
the actual data
Step 3: Convert the data into a dictionary from element #3 onwards.
Use values in element #2 as keys for the dictionary (namely x data
and y data). To ensure you have key:[list of values], use the
dict.setdefault(key,[]).append(value). This will ensure the data
is created as a `key:[list of values]' dictionary.
Step 4: Create a normal dataframe using the dictionary as all the
values are stored as key and values in the dictionary.
Step 5: Now that you have the dictionary, you want to create the
MultiIndex. Convert the column to MultiIndex.
Putting all this together, the code is:
import pandas as pd
text = ',file_05,,\r\nx data,y data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,'
line_text = [txt.split(',') for txt in text.split('\r\n')]
dct = {}
for x,y,z in line_text[2:]:
dct.setdefault(line_text[1][0], []).append(x)
dct.setdefault(line_text[1][1], []).append(y)
df = pd.DataFrame(dct)
df.columns = pd.MultiIndex.from_tuples([(line_text[0][i],line_text[1][i]) for i in [0,1]])
print (df)
Output of this will be:
file_05
x data y data
0 -970.0 -34.12164
1 -959.0 -32.37526
2 -949.0 -30.360199
3 -938.0 -28.74816
4 -929.0 -27.53912
5 -920.0 -25.92707
6 -911.0 -24.31503
7 -900.0 -23.64334
8 -891.0 -22.29997

You should convert your raw data to a table with python.
Save to csv file by import csv package with python.
from pandas import DataFrame
# s is raw datas
s = ",file_05,,\r\nx data,y data\r\n-970.0,-34.12164,\r\n-959.0,-32.37526,\r\n-949.0,-30.360199,\r\n-938.0,-28.74816,\r\n-929.0,-27.53912,\r\n-920.0,-25.92707,\r\n-911.0,-24.31503,\r\n-900.0,-23.64334,\r\n-891.0,-22.29997,"
# convert raw data to a table
table = [i.split(',') for i in s.split("\r\n")]
table = [i[:2] for i in table]
# table is like
"""
[['', 'file_05'],
['x data', 'y data'],
['-970.0', '-34.12164'],
['-959.0', '-32.37526'],
['-949.0', '-30.360199'],
...
['-891.0', '-22.29997']]
"""
# save to output.csv file
import csv
with open('output.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(table)
# Save to DataFrame df
from pandas import DataFrame
df = DataFrame (table[2:],columns=table[1][:2])
print(df)

Related

How to manipulate csv entries from horizontal to vertical

I have a csv with the following entries:
apple,orange,bannana,grape
10,5,6,4
four,seven,eight,nine
yes,yes,no,yes
3,5,7,4
two,one,six,nine
no,no,no,yes
2,4,7,8
yellow,four,eight,one
no,yes,no,no
I would like to make a new csv file with the following format and so on:
apple,10,four,yes
orange,5,seven,yes
bannana,6,seven,no
grape,4,nine,yes
apple,3,two,no
orange,5,one,no
bannana,7,six,no
grape,4,nine,yes
So after grape it starts at apple with the new values.
I have tried using pandas DataFrames but cant figure how to get the data formatted how I need it.

You could try the following in pure Python (data.csv name of input file):
import csv
from itertools import islice
with open("data.csv", "r") as fin,\
open("data_new.csv", "w") as fout:
reader, writer = csv.reader(fin), csv.writer(fout)
header = next(reader)
length = len(header) - 1
while (rows := list(islice(reader, length))):
writer.writerows([first, *rest] for first, rest in zip(header, zip(*rows)))
Or with Pandas:
import pandas as pd
df = pd.read_csv("data.csv")
df = pd.concat(gdf.T for _, gdf in df.set_index(df.index % 3).groupby(df.index // 3))
df.reset_index().to_csv("data_new.csv", index=False, header=False)
Output file data_new.csv for the provided sample:
apple,10,four,yes
orange,5,seven,yes
bannana,6,eight,no
grape,4,nine,yes
apple,3,two,no
orange,5,one,no
bannana,7,six,no
grape,4,nine,yes
apple,2,yellow,no
orange,4,four,yes
bannana,7,eight,no
grape,8,one,no

Hope it works for you.
df = pd.read_csv('<source file name>')
df.T.to_csv('<destination file name>')

You can transpose your dataframe in pandas as below.
pd.read_csv('file.csv', index_col=0, header=None).T
this question is already answered:
Can pandas read a transposed CSV?
According to your new description, the problem is completely changed.
you need to split your dataframe to subsets and merge them.
# Read dataframe without header
df = pd.read_csv('your_dataframe.csv', header=None)
# Create an empty DataFrame to store transposed data
tr = pd.DataFrame()
# Create, transpose and append subsets to new DataFrame
for i in range(1,df.shape[0],3):
... temp = pd.DataFrame()
... temp = temp.append(df.iloc[0])
... temp = temp.append(df.iloc[i:i+3])
... temp = temp.transpose()
... temp.columns = [0,1,2,3]
... tr = d.append(temp)

Convert list of multiple strings into a Python data frame

I have a list of string values I read this from a text document with splitlines. which yields something like this
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
I have tried this
for i in X:
textnew = i.split("|")
data[x] = textnew
I want to make a dataframe out of this
Name Contact Education
SMITH 12345 Graduate
NITA 11111 Diploma

You can read it directly from your file by specifying a sep argument to pd.read_csv.
df = pd.read_csv("/path/to/file", sep='|')
Or if you wish to convert it from list of string instead:
data = [row.split('|') for row in X]
headers = data.pop(0) # Pop the first element since it's header
df = pd.DataFrame(data, columns=headers)

you had it almost correct actually, but don't use data as dictionary(by using keys - data[x] = textnew):
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
df = []
for i in X:
df.append(i.split("|"))
print(df)
# [['NAME', 'Contact', 'Education'], ['SMITH', '12345', 'Graduate'], ['NITA', '11111', 'Diploma']]
Depends on further transformations, but pandas might be overkill for this kind of task

Here is a solution for your problem
import pandas as pd
X = ["NAME|Contact|Education","SMITH|12345|Graduate","NITA|11111|Diploma"]
data = []
for i in X:
data.append( i.split("|") )
df = pd.DataFrame( data, columns=data.pop(0))

In your situation, you can avoid to load the file using readlines and use pandas for take care about loading the file:
As mentioned above, the solution is a standard read_csv:
import os
import pandas as pd
path = "/tmp"
filepath = "file.xls"
filename = os.path.join(path,filepath)
df = pd.read_csv(filename, sep='|')
print(df.head)
Another approach (in such situation when you have no access to the file or you have to deal with a list of string) can be wrap the list of string as a text file, then load normally using pandas
import pandas as pd
from io import StringIO
X = ["NAME|Contact|Education", "SMITH|12345|Graduate", "NITA|11111|Diploma"]
# Wrap the string list as a file of new line
DATA = StringIO("\n".join(X))
# Load as a pandas dataframe
df = pd.read_csv(DATA, delimiter="|")
Here the result

How do i add column header, in the second row in a pandas dataframe?

I have a data frame frame from pandas and now I want to add columns names, but only for the second row. Here is an example of my previous output:
Desired output:
My code:
data_line=open("file1.txt", mode="r")
lines=[]
for line in data_line:
lines.append(line)
for i, line in enumerate(lines):
# print('{}={}'.format(i+1, line.strip()))
file1_header=lines[0]
num_line=1
Dictionary_File1={}
Value_File1= data_type[0:6]
Value_File1_short=[]
i=1
for element in Value_File1:
type=element.split(',')
Value_File1_short.append(type[0] + ", " + type[1] + ", " + type[4])
i += 1
Dictionary_File1[ file1_header]=Value_File1_short
pd_file1=pd.DataFrame.from_dict(Dictionary_File1)

You should have a look at DataFrame.read_csv. The header keyword parameter allows you to indicate a line in the file to use for header names.
You could probably do it with something like:
pd.read_csv("file1.txt", header=1)
From my python shell I tested it out with:
>>> from io import StringIO # I use python3
>>> import pandas as pd
>>> >>> data = """Type Type2 Type3
... A B C
... 1 2 3
... red blue green"""
>>> # StringIO below allows us to use "data" as input to read_csv
>>> # "sep" keyword is used to indicate how columns are separated in data
>>> df = pd.read_csv(StringIO(data), header=1, sep='\s+')
>>> df
A B C
0 1 2 3
1 red blue green

You can write a row using the csv module before writing your dataframe to the same file. Notice this won't help when reading back to Pandas, which doesn't work with "duplicate headers". You can create MultiIndex columns, but this isn't necessary for your desired output.
import pandas as pd
import csv
from io import StringIO
# input file
x = """A,B,C
1,2,3
red,blue,green"""
# replace StringIO(x) with 'file.txt'
df = pd.read_csv(StringIO(x))
with open('file.txt', 'w', newline='') as fout:
writer = csv.writer(fout)
writer.writerow(['Type', 'Type2', 'Type3'])
df.to_csv(fout, index=False)
# read file to check output is correct
df = pd.read_csv('file.txt')
print(df)
# Type Type2 Type3
# 0 A B C
# 1 1 2 3
# 2 red blue green

So if I understand properly, you have a file "file.txt" containing your data, and a list containing the types of your data.
You want to add the list of types, to the pandas.DataFrame of your data. Correct?
If so, you can read the data from the txt file into a pandas.df using pandas.read_csv(), and then define the columns headers using df.columns.
So it would look something like:
df = pd.read_csv("file1.txt", header=None)
df.columns = data_type[0:6]
I hope this helps!
Cheers

How to get rid of "chaning" rows above headers (lenght changes everytime but headers and data are always the same)

I have the following csv file:
csv file
there are about 6-8 rows at the top of the file, I know how to make a new dataframe in Pandas, and filter the data:
df = pd.read_csv('payments.csv')
df = df[df["type"] == "Order"]
print df.groupby('sku').size()
df = df[df["marketplace"] == "amazon.com"]
print df.groupby('sku').size()
df = df[df["promotional rebates"] > ((df["product sales"] + df["shipping credits"])*-.25)]
print df.groupby('sku').size()
df.to_csv("out.csv")
My issue is with the Headers. I need to
1. look for the row that has date/time & another field.
That way I do not have to change my code if the file keeps changing the row count before the headers.
2. make a new DF excluding those rows.
What is the best approach, to make sure the code does not break to changes as long as the header row exist and has a few Fields matching. Open for any suggestions.

considering a CSV file like this:
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
You can use the following to compute the header's line number:
#load the first 20 rows of the csv file as a one column dataframe
#to look for the header
df = pd.read_csv("csv_file.csv", sep="|", header=None, nrows=20)
# use a regular expression to look check which column has the header
# the following will generate a array of booleans
# with True if the row contains the regex "datetime.+settelment id.+type"
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
# get the row index of the header
header_index = df[indices].index.values[0]
and read the csv file starting from the header's index:
# to read the csv file, use the following:
df = pd.read_csv("csv_file.csv", skiprows=header_index+1)
Reproducible example:
import pandas as pd
from StringIO import StringIO
st = """
random line content
another random line
yet another one
datetime, settelment id, type
dd, dd, dd
"""
df = pd.read_csv(StringIO(st), sep="|", header=None, nrows=20)
indices = df.iloc[:,0].str.contains("datetime.+settelment id.+type")
header_index = df[indices].index.values[0]
df = pd.read_csv(StringIO(st), skiprows=header_index+1)
print(df)
print("columns")
print(df.columns)
print("shape")
print(df.shape)
Output:
datetime settelment id type
0 dd dd dd
columns
Index([u'datetime', u' settelment id', u' type'], dtype='object')
shape
(1, 3)

assigning arrays from CSV with pandas module

If I have a file of 100+ columns, how can I make each column into an array, referenced by the column header, without having to do header1 = [1,2,3], header2 = ['a','b','c'] , and so on..?
Here is what I have so far, where headers is a list of the header names:
import pandas as pd
data = []
df = pd.read_csv('outtest.csv')
for i in headers:
data.append(getattr(df, i).values)
I want each element of the array headers to be the variable name of the corresponding data array in data (they are in order). Somehow I want one line that does this so that the next line I can say, for example, test = headername1*headername2.

import pandas as pd
If the headers are in the csv file, we can simply use:
df = pd.read_csv('outtest.csv')
If the headers are not present in the csv file:
headers = ['list', 'of', 'headers']
df = pd.read_csv('outtest.csv', header=None, names=headers)
Assuming headername1 and headername2 are constants:
test = df.headername1 * df.headername2
Or
test = df['headername1'] * df['headername2']
Assuming they are variable:
test = df[headername1] * df[headername2]
By default this form of access returns a pd.Series, which is generally interoperable with numpy. You can fetch the values explicitly using .values:
df[headername1].values
But you seem to already know this.

I think I see what you're going for, so using a StringIO object to simulate a file object as the setup:
import pandas as pd
import StringIO
txt = '''foo,bar,baz
1, 2, 3
3, 2, 1'''
fileobj = StringIO.StringIO(txt)
Here's the approximate code you want:
data = []
df = pd.read_csv(fileobj)
for i in df.columns:
data.append(df[i])
for i in data:
print i
prints
0 1
1 3
Name: foo
0 2
1 2
Name: bar
0 3
1 1
Name: baz

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create dataframe from a string Python - python

Related

How to manipulate csv entries from horizontal to vertical

Convert list of multiple strings into a Python data frame

How do i add column header, in the second row in a pandas dataframe?

How to get rid of "chaning" rows above headers (lenght changes everytime but headers and data are always the same)

assigning arrays from CSV with pandas module

Categories

Resources