Extracting characters from a mixed format csv file using Pandas

Extracting characters from a mixed format csv file using Pandas - python

So, I have a .csv file which looks like this:
station_id year january february ... december
210018 1916 nodata 221 417a
210018 1917 17b 98 44
....
210252 1910 54e 110 nodata
210252 1911 99d 24i 77
...
I need to extract letters from a to i (a-i) from the data. These letters mean numbers of missing days per month: a means 1 day and i means 9 missing days. Right now I don't care about 'nodata' cells. After extracting letters from data cells, I want to calculate total number of missing days per month:
station_id year january february ... december N_missingdays
210018 1916 nodata 221 417 1(a)
210018 1917 17 98 44 11(b+i)
....
210252 1910 54 110 nodata 8(e+c)
210252 1911 99 24 77 13(d+i)
Probably, the best way to do it, is to create a dictionary with station_id, year and number of missing days. Here is what I was trying to do:
with open('filepath') as file:
file_reader = reader(file)
for i,row in enumerate(file_reader):
for j,item in enumerate(row):
if item[len(item)-1]=='a':
file_reader[i][j]=''
print file_reader
But this function just deleting letters from the file and it doesn't work correctly. I don't know exactly how to extract letters from the .csv file and calculate their meaning.
The other thing I was trying to do is this:
with open('filepath') as file:
file_reader = reader(file)
next(file_reader)
letters_dict={}
for row in file_reader:
station_id,year,months = row[1],row[2],row[4:]
letters_list[station_id,year] = months.count('[0-9][a]') + ... + months.count('[0-9][i]') + letters_dict.get(year, 0) + letters_dict.get(station_id,0)
But this code writes in a dictionary only zeroes.

Related

Rearranging data from a table in python to correlate columns

I got a table that look like this:
code
year
month
Value A
Value B
1
2020
1
120
100
1
2020
2
130
90
1
2020
3
90
89
1
2020
4
67
65
...
...
...
...
...
100
2020
10
90
90
100
2020
11
115
100
100
2020
12
150
135
I would like to know if there's a way to rearrange the data to find the correlation between A and B for every distinct code.
What I'm thinking is, for example, getting an array for every code, like:
[(A1,A2,A3...,A12),(B1,B2,B3...,B12)]
where A and B is the values for the respective month, and then I could see the correlation between these two columns. Is there a way to make this dynamic?

IIUC, you don't need to re-arrange to get the correlation for each "code". Instead, try with groupby:
>>> df.groupby("code").apply(lambda x: x["Value A"].corr(x["Value B"]))
code
1 0.830163
100 0.977093
dtype: float64

Iterate over specific rows, sum results and store in new row

I have a DataFrame in which I have already defined rows to be summed up and store the results in a new row.
For example in Year 1990:
Category
A
B
C
D
Year
E
147
78
476
531
1990
F
914
356
337
781
1990
G
117
874
15
69
1990
H
45
682
247
65
1990
I
20
255
465
19
1990
Here, the rows G - H should be summed up and the results stored in a new row. The same categories repeat every year from 1990 - 2019
I have already tried it with .iloc e.g. [4:8], [50:54] [96:100] and so on, but with iloc I can not specify multiple index. I can't manage to make a loop over the single years.
Is there a way to sum the values in categories (G-H) for each year (1990 -2019)?

I'm not sure the multiple index what you mean.
It usually appear after some group and aggregate function.
At your table, it looks just multiple column
So, if I understand correctly.
Here a complete code to show how to use the multiple condition of DataFrame
import io
import pandas as pd
data = """Category A B C D Year
E 147 78 476 531 1990
F 914 356 337 781 1990
G 117 874 15 69 1990
H 45 682 247 65 1990
I 20 255 465 19 1990"""
table = pd.read_csv(io.StringIO(data), delimiter="\t")
years = table["Year"].unique()
for year in years:
row = table[((table["Category"] == "G") | (table["Category"] == "H")) & (table["Year"] == year)]
row = row[["A", "B", "C", "D"]].sum()
row["Category"], row["Year"] = "sum", year
table = table.append(row, ignore_index=True)

If you are only interested in G/H, you can slice with isin combined with boolean indexing, then sum:
df[df['Category'].isin(['G', 'H'])].sum()
output:
Category GH
A 162
B 1556
C 262
D 134
Year 3980
dtype: object
NB. note here the side effect of sum that combines the two "G"/"H" strings into one "GH".
Or, better, set Category as index and slice with loc:
df.set_index('Category').loc[['G', 'H']].sum()
output:
A 162
B 1556
C 262
D 134
Year 3980
dtype: int64

Adding a matching value from 3 different DataFrames, not the entire column Python

I have three different DateFrames (df2019, df2020, and df2021) and the all have the same columns(here are a few) with some overlapping 'BrandID':
BrandID StockedOutDays Profit SalesQuantity
243 01-02760 120 516452.76 64476
138 01-01737 96 603900.0 80520
166 01-02018 125 306796.8 52896
141 01-01770 109 297258.6 39372
965 02-35464 128 214039.2 24240
385 01-03857 92 326255.16 30954
242 01-02757 73 393866.4 67908
What I'm trying to do is add the value from one column for a specific BrandID from each of the 3 DataFrame's. In my specific case, I'd like to add the value of 'Sales Quantity' for 'BrandID' = 01-02757 from df2019, df2020 and df2021 and get a line I can run to see a single number.
I've searched around and tried a bunch of different things, but am stuck. Please help, thank you!
EDIT *** I'm looking for something like this I think, I just don't know how to sum them all together:
df2021.set_index('BrandID',inplace=True)
df2020.set_index('BrandID',inplace=True)
df2019.set_index('BrandID',inplace=True)
df2021.loc['01-02757']['SalesQuantity']+df2020.loc['01-02757']['SalesQuantity']+
df2019.loc['01-02757']['SalesQuantity']

import pandas as pd
df2019 = pd.DataFrame([{"BrandID":"01-02760", "StockedOutDays":120, "Profit":516452.76, "SalesQuantity":64476},
{"BrandID":"01-01737", "StockedOutDays":96, "Profit":603900.0, "SalesQuantity":80520}])
df2020 = pd.DataFrame([{"BrandID":"01-02760", "StockedOutDays":123, "Profit":76481.76, "SalesQuantity":2457},
{"BrandID":"01-01737", "StockedOutDays":27, "Profit":203014.0, "SalesQuantity":15648}])
df2019["year"] = 2019
df2020["year"] = 2020
df = pd.DataFrame.append(df2019, df2020)
df_sum = df.groupby("BrandID").agg("sum").drop("year",axis=1)
print(df)
print(df_sum)
df:
BrandID StockedOutDays Profit SalesQuantity year
0 01-02760 120 516452.76 64476 2019
1 01-01737 96 603900.00 80520 2019
0 01-02760 123 76481.76 2457 2020
1 01-01737 27 203014.00 15648 2020
df_sum:
StockedOutDays Profit SalesQuantity
BrandID
01-01737 123 806914.00 96168
01-02760 243 592934.52 66933

find minimum from text file

I am new to Python and I am trying to figure out how to get my program to find the minimum after it reads specific columns and each rows from the file. Can anyone help me with this?
This is how an example of how my text file looks like:
05/01 80 2002 5 1966 19 2000 45 2010
06/22 77 1980 4 1945 22 1986 58 2000
---------------------------------------------------------------------------
Day Max Year Min Year Max Year Min Year
---------------------------------------------------------------------------
08/01 79 2002 8 1981 28 1900 54 1988
08/02 79 1989 5 1971 31 1994 60 1998
This is my code(below) that I have so far.
def main ()
file = open ('file.txt', 'r')
for num in file.read().splitlines():
i = num.split()
if len(i) > 5:
print('Day:{}\n' .format(i[0]))
print('Year:{}\n' .format(i[2]))
print('Lowest Temperature:{}'.format(i[1]))
This is the output I get from my code. (it prints out text as well) :
Day:Day
Year:Year
Lowest Temperature:Max
Day: 3/11
Year:1920
Lowest Temperature:78
Day:11/02
Year:1974
Lowest Temperature:80
Day:11/03
Year:1974
Lowest Temperature:74
I am trying to find the lowest temperature from my text file and print out the day and the year associated with that temp. My output should look like this. Thanks to everyone who is willing to help me with this.
Day:10/02
Year:1994
Lowest Temperature:55

You can use your current method to read the file into lines, then split each line into individual columns.
You can then make use of min(), using the column containing the minimum temperature (in this case column 3) as the key to min().
with open('test.txt') as f:
data = f.read().splitlines()
data = [i.split() for i in data if any(j.isdigit() for j in i)]
data = min(data, key=lambda x: int(x[3]))
print('Day: {}\nYear: {}\nLowest Temperature: {}' .format(data[0], data[2], data[3]))
Output for your sample file:
Day: 06/22
Year: 1980
Lowest Temperature: 4

Group rows in a CSV by blocks of 25

I have a csv file with 2 columns, representing a distribution of items per year which looks like this:
A B
1900 10
1901 2
1903 5
1908 8
1910 25
1925 3
1926 4
1928 1
1950 10
etc, about 15000 lines.
When making a distribution diagram based on this data, it's too many points on an axe, not very pretty. I want to group rows by blocks of 25 years, so that at the end I would have less points at the axe.
So, for example, from 1900 till 1925 I would have a sum of produced items, 1 row in A column and 1 row in B column:
1925 53
1950 15
So far I only figured how to convert the data in csv file to int:
o=open('/dates_dist.csv', 'rU')
mydata = csv.reader(o)
def int_wrapper(mydata):
for v in reader:
yield map(int, v)
reader = int_wrapper(mydata)
Can't find how to do it further...

You could use itertools.groupby:
import itertools as IT
import csv
def int_wrapper(mydata):
for v in mydata:
yield map(int, v)
with open('data', 'rU') as o:
mydata = csv.reader(o)
header = next(mydata)
reader = int_wrapper(mydata)
for key, group in IT.groupby(reader, lambda row: (row[0]-1)//25+1):
year = key*25
total = sum(row[1] for row in group)
print(year, total)
yields
(1900, 10)
(1925, 43)
(1950, 15)
Note that 1900 to 1925 (inclusive) spans 26 years, not 25. So
if you want to group 25 years, given the way you are reporting the totals, you probably want the half-open interval (1900, 1925].
The expression row[0]//25 takes the year and integer divides by 25.
This number will be the same for all numbers in the range [1900, 1925).
To make the range half-open on the left, subtract and add 1: (row[0]-1)//25+1.

Here is my approach . Its definitely not the most engaging python code, but could be a way to achieve the desired output.
if __name__ == '__main__':
o=open('dates_dist.csv', 'rU')
lines = o.read().split("\n") # Create a list having each line of the file
out_dict = {}
curr_date = 0;
curr_count = 0
chunk_sz = 25; #years
if len(lines) > 0:
line_split = lines[0].split(",")
start_year = int(line_split[0])
curr_count = 0
# Iterate over each line of the file
for line in lines:
# Split at comma to get the year and the count.
# line_split[0] will be the year and line_split[1] will be the count.
line_split = line.split(",")
curr_year = int(line_split[0])
time_delta = curr_year-start_year
if time_delta<chunk_sz or time_delta == chunk_sz:
curr_count = curr_count + int(line_split[1])
else:
out_dict[start_year+chunk_sz] = curr_count
start_year = start_year+chunk_sz
curr_count = int(line_split[1])
#print curr_year , curr_count
out_dict[start_year+chunk_sz] = curr_count
print out_dict

You could create a dummy column and group by it after doing some integer division:
df['temp'] = df['A'] // 25
>>> df
A B temp
0 1900 10 76
1 1901 2 76
2 1903 5 76
3 1908 8 76
4 1910 25 76
5 1925 3 77
6 1926 4 77
7 1928 1 77
8 1950 10 78
>>> df.groupby('temp').sum()
A B
temp
76 9522 50
77 5779 8
78 1950 10
My numbers are slightly different from yours since I am technically grouping from 1900-1924, 1925-1949, and 1950-1974, but the idea is the same.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting characters from a mixed format csv file using Pandas - python

Related

Rearranging data from a table in python to correlate columns

Iterate over specific rows, sum results and store in new row

Adding a matching value from 3 different DataFrames, not the entire column Python

find minimum from text file

Group rows in a CSV by blocks of 25

Categories

Resources