How to read unstructured csv in pandas

How to read unstructured csv in pandas - python

I have got a messy csv file (just extension is csv). But when i open this file in ms excel with ; delimited it looks like as below(dummy sample)-
I investigated this file and found following:
Some column has name and others does not have.
The length of row is variable but contains newline char to trigger next line start.
Question:
How can i read this table in pandas whereas all existing columns(headers) remain and blank columns are filled with consecutive numbers caring variable length of rows.
In fact i want to take 8 cell-value again and again until any row exhausts. from the header-less columns for analysis.
N.B. I have tried usecols,names,skiprows,sep etc in read_csv but with no success
EDIT
Added sample input and expected output (formatting is worse but pandas.read_clipboard() should work)
INPUT
car_id car_type entry_gate entry_time(ms) exit_gate exit_time(ms) traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] )
24 Bus 25 4300 26 48520 118.47 2.678999 509552.78 5039855.59 10.074 0.429 0.2012 0 509552.97 5039855.57 10.0821 0.3853 0.2183 20
25 Car 25 20 26 45900 113.91 2.482746 509583.7 5039848.78 4.5344 -0.1649 0.2398 0 509583.77 5039848.71
26 Car - - - - 109.68 8.859805 509572.75 5039862.75 4.0734 -0.7164 -0.1066 0 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17 5039855.55 10.0886 0.2636 0.2356 40
27 Car - - - - 119.84 3.075936 509582.73 5039862.78 1.191 0.5247 0.0005 0 509582.71 5039862.78 1.2015 0.5322
28 Car - - - - 129.64 4.347466 509591.07 5039862.9 1.6473 0.1987 -0.0033 0 509591.04 5039862.89 1.6513 0.2015 -0.0036 20
Expected OUTPUT(dataframe)
car_id car_type entry_gate entry_time(ms) exit_gate exit_time(ms) traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] 1 2 3 4 5 6 7 8 9 10 11 12
24 Bus 25 4300 26 48520 118.47 2.678999 509552.78 5039855.59 10.074 0.429 0.2012 0 509552.97 5039855.57 10.0821 0.3853 0.2183 20
25 Car 25 20 26 45900 113.91 2.482746 509583.7 5039848.78 4.5344 -0.1649 0.2398 0 509583.77 5039848.71
26 Car - - - - 109.68 8.859805 509572.75 5039862.75 4.0734 -0.7164 -0.1066 0 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17 5039855.55 10.0886 0.2636 0.2356 40
27 Car - - - - 119.84 3.075936 509582.73 5039862.78 1.191 0.5247 0.0005 0 509582.71 5039862.78 1.2015 0.5322
28 Car - - - - 129.64 4.347466 509591.07 5039862.9 1.6473 0.1987 -0.0033 0 509591.04 5039862.89 1.6513 0.2015 -0.0036 20

Preprocessing
Function get_names() open file, check max length of splitted rows.
Then I read first row and add missing values from max length.
Last value of first row is ), so I remove it by firstline[:-1] and then I add
to range missing columns by +1 rng = range(1, m - lenfirstline + 2).
+2 is because range starts from value 1.
Then you can use function read_csv, skipp first line and as names use output from get_names().
import pandas as pd
import csv
#preprocessing
def get_names():
with open('test/file.txt', 'r') as csvfile:
reader = csv.reader(csvfile)
num = []
for i, row in enumerate(reader):
if i ==0:
firstline = ''.join(row).split()
lenfirstline = len(firstline)
#print firstline, lenfirstline
num.append(len(''.join(row).split()))
m = max(num)
rng = range(1, m - lenfirstline + 2)
#remove )
rng = firstline[:-1] + rng
return rng
#names is list return from function
df = pd.read_csv('test/file.txt', sep="\s+", names=get_names(), index_col=[0], skiprows=1)
#temporaly display 10 rows and 30 columns
with pd.option_context('display.max_rows', 10, 'display.max_columns', 30):
print df
car_type entry_gate entry_time(ms) exit_gate exit_time(ms) \
car_id
24 Bus 25 4300 26 48520
25 Car 25 20 26 45900
26 Car - - - -
27 Car - - - -
28 Car - - - -
traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] \
car_id
24 118.47 2.678999 509552.78 5039855.59
25 113.91 2.482746 509583.70 5039848.78
26 109.68 8.859805 509572.75 5039862.75
27 119.84 3.075936 509582.73 5039862.78
28 129.64 4.347466 509591.07 5039862.90
speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] \
car_id
24 10.0740 0.4290 0.2012 0
25 4.5344 -0.1649 0.2398 0
26 4.0734 -0.7164 -0.1066 0
27 1.1910 0.5247 0.0005 0
28 1.6473 0.1987 -0.0033 0
1 2 3 4 5 6 7 \
car_id
24 509552.97 5039855.57 10.0821 0.3853 0.2183 20 NaN
25 509583.77 5039848.71 NaN NaN NaN NaN NaN
26 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17
27 509582.71 5039862.78 1.2015 0.5322 NaN NaN NaN
28 509591.04 5039862.89 1.6513 0.2015 -0.0036 20 NaN
8 9 10 11 12
car_id
24 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN
26 5039855.55 10.0886 0.2636 0.2356 40
27 NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN
Postprocessing
First you have to estimate max number of columns N. I know their real number is 26, so I estimate to N = 30. Function read_csv with parameter name = range(N) return NaN columns, what are difference between estimated and real length of columns.
After dropping you can select first row with columns names, where are not NaN (I remove last column ) by [:-1] ) - df1.loc[0].dropna()[:-1].
Then you can append new Series with range from 1 to length of NaN values in first row.
Last first row is removed by subset of df.
#set more as estimated number of columns
N = 30
df1 = pd.read_csv('test/file.txt', sep="\s+", names=range(N))
df1 = df1.dropna(axis=1, how='all') #drop columns with all NaN
df1.columns = df1.loc[0].dropna()[:-1].append(pd.Series(range(1, len(df1.columns) - len(df1.loc[0].dropna()[:-1]) + 1 )))
#remove first line with uncomplete column names
df1 = df1.ix[1:]
print df1.head()

Related

Filling missing values based on a specific column condition

I have a data frame like this :
Day
Type
From
to
01/09/2021
car
170
Nan
02/09/2021
car
140
Nan
03/09/2021
none
120
77
04/09/2021
car
15
45
05/09/2021
car
34
Nan
06/09/2021
car
36
84
07/09/2021
none
23
11
08/09/2021
car
36
Nan
The logic is
For each row containing a Type none
fill the previous Nan rows in column to with values from column
from(Only for the beginning of the dataset until the first row with Type none)
fill the following Nan rows in column to with values from column to
The values used to fill the missing needs to be taken from the latest
row containing a Type none
Desired output :
Day
Type
From
to
01/09/2021
car
170
120
02/09/2021
car
140
120
03/09/2021
none
120
77
04/09/2021
car
15
45
05/09/2021
car
34
77
06/09/2021
car
36
84
07/09/2021
none
23
11
08/09/2021
car
36
11
I tried using ffill and bfill , but I'm not sure how to apply the conditions

Here in the ind list, the indexes of the rows are copied, where 'Type' == 'none'. The dataframe is copied to aaa through a slice on the first element of ind. In ind1 I get the indices of the first rows with 'to' == 'Nan' and set the values via loc.
ind_to its elements are fed into list comprehensions, the desired values are set through the my_finc function.
import pandas as pd
df = pd.read_csv('df.csv', header=0)
ind = df[df['Type'] == 'none'].index
aaa = df[:ind[0]]
ind1 = aaa[aaa['to'] == 'Nan'].index
df.loc[ind1, 'to'] = df.loc[ind[0], 'From']
ind_to = df[df['to'] == 'Nan'].index
def my_finc(x):
bbb = df.loc[: x, 'Type']
kkk = bbb[bbb == 'none'].index
df.loc[x, 'to'] = df.loc[kkk[-1], 'to']
[my_finc(i) for i in ind_to]
print(df)
Output
Day Type From to
0 01/09/2021 car 170 120
1 02/09/2021 car 140 120
2 03/09/2021 none 120 77
3 04/09/2021 car 15 45
4 05/09/2021 car 34 77
5 06/09/2021 car 36 84
6 07/09/2021 none 23 11
7 08/09/2021 car 36 11

How to add values in a data frame with specific conditions

I have a code where it outputs the amount of times a product is bought in a specific month in all stores; however, I was wondering how I would be able to have the sum of 3 conditions, where python would add the products from a specific month and a specific store.
This is my code so far:
df = df.groupby(['Month_Bought'])['Amount_Bought'].sum()
print(df)
Output:
01-2020 27
02-2020 26
03-2020 24
04-2020 23
05-2020 31
06-2020 33
07-2020 26
08-2020 30
09-2020 33
10-2020 26
11-2020 30
12-2020 30
Need to separate the data to make the dataframe look like this:
Store1 Store2
01-2020 3 24
02-2020 4 22
03-2020 8 16
04-2020 4 19
05-2020 10 21
06-2020 11 21
07-2020 12 14
08-2020 10 20
09-2020 3 30
10-2020 14 12
11-2020 21 9
12-2020 9 21

Assuming your data is long (a column contains values for which store a product was purchased), you could group by store and month:
import pandas as pd
records = [
{'Month_Bought':'01-2020', 'Amount_Bought':1, 'Store': 'Store1'},
{'Month_Bought':'01-2020', 'Amount_Bought':2, 'Store': 'Store2'},
{'Month_Bought':'02-2020', 'Amount_Bought':2, 'Store': 'Store1'},
{'Month_Bought':'02-2020', 'Amount_Bought':4, 'Store': 'Store2'}
]
df = pd.DataFrame.from_records(records)
# Initial dataframe
Month_Bought Amount_Bought Store
0 01-2020 1 Store1
1 01-2020 2 Store2
2 02-2020 2 Store1
3 02-2020 4 Store2
# Now groupby store and month
df_agg = df.groupby(['Store', 'Month_Bought'], as_index=False)['Amount_Bought'].sum()
# Convert from long to wide:
df_agg_pivot = df_agg.pivot(index='Month_Bought', columns='Store', values='Amount_Bought')
# Reformat
df_agg_pivot = df_agg_pivot.reset_index()
df_agg_pivot.columns.name = None
# Final result:
Month_Bought Store1 Store2
0 01-2020 1 2
1 02-2020 2 4

difference between two rows pandas

i have a dataframe as :
id|amount|date
20|-7|2017:12:25
20|-170|2017:12:26
20|7|2017:12:27
i want to subtract each row from another for 'amount' column:
the output should be like:
id|amount|date|amount_diff
20|-7|2017:12:25|0
20|-170|2017:12:26|-177
20|7|2017:12:27|-163
i used the code:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['invoice_amount'].diff()
and obtained the output as:
id|amount|date|amount_diff
20|-7|2017:12:25|163
20|-170|2017:12:26|-218
20|48|2017:12:27|0

IIUC you need:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['amount'].add(df['amount'].shift()).fillna(0)
print (df)
id amount date amount_diff
0 20 -7 2017:12:25 0.0
1 20 -170 2017:12:26 -177.0
2 20 7 2017:12:27 -163.0
Because if want subtract your solution should work:
df.sort_values(by='date',inplace=True)
df['amount_diff1'] = df['amount'].sub(df['amount'].shift()).fillna(0)
df['amount_diff2'] = df['amount'].diff().fillna(0)
print (df)
id amount date amount_diff1 amount_diff2
0 20 -7 2017:12:25 0.0 0.0
1 20 -170 2017:12:26 -163.0 -163.0
2 20 7 2017:12:27 177.0 177.0

Modify and round numbers in a pandas dataframe in Python

Long story short, I have a csv file which I read as a pandas dataframe. The file contains a weather report, but all of the measurements for temperature are in Fahrenheit. I've figured out how to convert them:
import pandas as np
df = np.read_csv('report.csv')
df['average temperature'] = (df['average temperature'] - 32) * 5/9
But then the data for this column is in decimals up to 6 points.
I've found code that will round up all the data in the dataframe, but I need only this column.
df.round(2)
I don't like how it has to be a separate piece of code on a separate line and how it modifies all of my data. Is there a way to go about this problem more elegantly? Is there a way to apply this to other columns in my dataframe, such as maximum temperature and minimum temperature without having to copy the above piece of code?

For round only some columns use subset:
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = df[cols].round(2)
If want convert only some columns from list:
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = ((df[cols] - 32) * 5/9).round(2)
If want round each column separately:
df['average temperature'] = df['average temperature'].round(2)
df['maximum temperature'] = df['maximum temperature'].round(2)
df['minimum temperature'] = df['minimum temperature'].round(2)
Sample:
df = (pd.DataFrame(np.random.randint(30, 100, (10, 3)),
columns=['maximum temperature','minimum temperature','average temperature'])
.assign(a='m', b=range(10)))
print (df)
maximum temperature minimum temperature average temperature a b
0 97 60 98 m 0
1 64 86 64 m 1
2 32 64 95 m 2
3 60 56 93 m 3
4 43 89 64 m 4
5 40 62 86 m 5
6 37 40 70 m 6
7 61 33 46 m 7
8 36 44 46 m 8
9 63 30 33 m 9
cols = ['maximum temperature','minimum temperature','average temperature']
df[cols] = ((df[cols] - 32) * 5/9).round(2)
print (df)
maximum temperature minimum temperature average temperature a b
0 36.11 15.56 36.67 m 0
1 17.78 30.00 17.78 m 1
2 0.00 17.78 35.00 m 2
3 15.56 13.33 33.89 m 3
4 6.11 31.67 17.78 m 4
5 4.44 16.67 30.00 m 5
6 2.78 4.44 21.11 m 6
7 16.11 0.56 7.78 m 7
8 2.22 6.67 7.78 m 8
9 17.22 -1.11 0.56 m 9

Here's a single line solution with apply and a conversion function.
def convert_to_celsius (f):
return 5.0/9.0*(f-32)
df[['Column A','Column B']] = df[['Column A','Column B']].apply(convert_to_celsius).round(2)

Scale values of a particular column of python dataframe between 1-10

I have a dataframe which contains youtube videos views, I want to scale these values in the range of 1-10.
Below is the sample of how values look like? How do i normalize it in the range of 1-10 or is there any more efficient way to do this thing?
rating
4394029
274358
473691
282858
703750
255967
3298456
136643
796896
2932
220661
48688
4661584
2526119
332176
7189818
322896
188162
157437
1153128
788310
1307902

One possibility is performing a scaling with max.
1 + df / df.max() * 9
rating
0 6.500315
1 1.343433
2 1.592952
3 1.354073
4 1.880933
5 1.320412
6 5.128909
7 1.171046
8 1.997531
9 1.003670
10 1.276217
11 1.060946
12 6.835232
13 4.162121
14 1.415808
15 10.000000
16 1.404192
17 1.235536
18 1.197075
19 2.443451
20 1.986783
21 2.637193
Similar solution by Wen (now deleted):
1 + (df - df.min()) * 9 / (df.max() - df.min())
rating
0 6.498887
1 1.339902
2 1.589522
3 1.350546
4 1.877621
5 1.316871
6 5.126922
7 1.167444
8 1.994266
9 1.000000
10 1.272658
11 1.057299
12 6.833941
13 4.159739
14 1.412306
15 10.000000
16 1.400685
17 1.231960
18 1.193484
19 2.440368
20 1.983514
21 2.634189

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read unstructured csv in pandas - python

Related

Filling missing values based on a specific column condition

How to add values in a data frame with specific conditions

difference between two rows pandas

Modify and round numbers in a pandas dataframe in Python

Scale values of a particular column of python dataframe between 1-10

Categories

Resources